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Preface 


Together neuroscience and computing are driving forces for research and innovation. 
They enable new insights into the brain's complexity as well as biological information 
processing and lay ground for progress in future computing. Making use of this collab- 
orative effort by bringing together relevant key players in the field of neuroscience and 
future computing, the workshop on Brain-Inspired Computing (BrainComp) aims to shed 
a light on the digital transformation of neuroscience by high performance computing 
(HPC). 

The International Workshop on Brain-Inspired Computing, held in Cetraro, Italy, 
during July 15-19, 2019, was jointly organized by the Human Brain Project, the 
University of Calabria, the University of Groningen, and the Research Centre Juelich. A 
highlight of this workshop edition was the BrainComp Young Researchers Competition 
in which young researchers were invited to solve neuroscientific problems by using HPC 
resources that were kindly provided by leading European HPC centers. The workshop 
proceedings include contributions from renowned scientists and early career researchers 
who participated in the workshop. It includes research on brain atlasing, multi-scale 
models and simulation, and HPC as well as data infrastructures for neuroscience, as 
well as artificial and natural neural architectures. All submissions were evaluated in a 
single-blind review process. The acceptance rate for BrainComp 2019 was 100%. 
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Abstract. The 'BigBrain' is a high-resolution data set of the human brain that 
enables three-dimensional (3D) analyses with a 20 um spatial resolution at nearly 
cellular level. We use this data set to explore pre-a (cell) islands of layer 2 in 
the entorhinal cortex (EC), which are early affected in Alzheimer's disease and 
have therefore been the focus of research for many years. They appear mostly 
in a round and elongated shape as shown in microscopic studies. Some studies 
suggested that islands may be interconnected based on analyses of their shape 
and size in two-dimensional (2D) space. Here, we characterized morphological 
features (shape, size, and distribution) of pre-o islands in the ‘BigBrain’, based 
on 3D-reconstructions of gapless series of cell-body-stained sections. The EC 
was annotated manually, and a machine-learning tool was trained to identify and 
segment islands with subsequent visualization using high-performance computing 
(HPC). Islands were visualized as 3D surfaces and their geometry was analyzed. 
Their morphology was complex: they appeared to be composed of interconnected 
islands of different types found in 2D histological sections of EC, with various 
shapes in 3D. Differences in the rostral-to-caudal part of EC were identified by 
specific distribution and size of islands, with implications for connectivity and 
function of the EC. 3D compactness analysis found more round and complex 
islands than elongated ones. The present study represents a use case for studying 
large microscopic data sets. It provides reference data for studies, e.g. investigating 
neurodegenerative diseases, where specific alterations in layer 2 were previously 
reported. 


Keywords: Entorhinal cortex - Pre-a islands - ‘BigBrain’ - Machine-learning - 
3D visualization - Large data sets 
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1 Introduction 


The ‘BigBrain’ model represents a three-dimensional (3D)-reconstructed data set at 
histological sections with a spatial resolution of 20 jum isotropic [1]. It is based on 
a series of 7404 images of histological sections, where each was scanned originally 
with an in-plane resolution of 10 jum, down-sampled to 20 jum, resulting in a total 
data size of 1 TByte [1]. This data set and model represents a unique high-resolution 
anatomical reference for applications from neuroimaging, modeling and simulation, 
neurophysiology as well as for research on artificial neuronal networks (for an overview 
of recent research see https://bigbrainproject.org/hiball.html). At the same time, the 
‘BigBrain’ is both a research target and a tool to set up workflows for analyzing large data 
sets, where methods from artificial intelligence (AL), e.g. machine learning, are part of 
advanced workflows running on the supercomputers to advance our understanding of the 
complexity of the brain. Along this line of reasoning, we here address the microstructural 
organization of the human entorhinal cortex (EC). 

The EC is a brain area of the mesial temporal cortex and occupies the rostral part of 
the parahippocampal gyrus (PhG) (Fig. 1). A unique feature of this part of the so-called 
periallocortex is the Substratum dissecans, an almost cell-free sublayer [2—6]. It extends 
parallel to the cortical surface in the midst of the cortex, macroscopically often separating 
EC into an inner and an outer principal layer [3, 5]. The most prominent feature of EC 
is layer 2 (or Stratum stellare [4]), which is composed of neurons clustered into groups 
with different numbers of neurons and different shapes and sizes, the so-called pre-« 
islands [3, 5] (Fig. 1). The surrounding tissue (‘neuropil’) separates the pre-a islands 
from each other. Their morphology can presently only be studied in two-dimensional 
(2D) histological sections of postmortem brain with a spatial resolution that is much 
higher than that of magnetic resonance imaging (MRI). 

Based on histological studies, the shape of pre-a islands was described as elongated 
in the rostral and medial part of EC, and round in the caudal extent [2]. This variability 
of pre-o islands in layer 2 combined with the specific cytoarchitectonic features of other 
layers along the extent of EC led to a partition of this brain area into several subareas 
[2, 7, 8]. For example, Brodmann (1909) [7] identified areas (BA)28a and BA28b, while 
Braak and Braak (1992) [8] defined a medial, lateral, and central subarea. 

Layer 2 and its pre-a islands are involved in the so-called perforant pathway that sup- 
ports the processing of spatial memory and context information [9]. Significant decrease 
of its pre-o cells [10] among other alterations are found in the early stages of Alzheimer's 
disease (AD) and may underline early symptoms such as mild cognitive deficits and 
spatial disorientation. This makes them an interesting research target for the research of 
AD, but also other neurodegenerative and psychiatric diseases (e.g. Parkinson's disease, 
schizophrenia) [8, 11]. 

Previous studies indicated a rather complex morphology of pre-a islands as these 
islands were connected by "bridges" in different types of sections [8, 12] and section 
planes [13]. For example, the latter study noted immunolabeled "bridges" in the inter- 
stices (spaces between two cell islands) that appeared to connect two islands. However, 
these morphological analyses were limited to only one section plane (e.g. by using either 
flat, tangential, or sagittal sections) and were therefore not able to follow the progression 
of pre-a islands in the other planes to fully characterize their shape and size. 
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cma 
round pre-a islands 


Fig. 1. Coronal sections of the human ‘BigBrain’ model (20 jum isotropic resolution) of the EC 
(on the left) and its cytoarchitecture (1 jum in-plane resolution) in its rostral (A), intermediate (B), 
and caudal part (C) (on the right). Dark contours: EC in the left hemisphere, light contour: right 
EC. Note the different types of pre-a islands (elongated, round, and complex) in layer 2 (white 
dotted lines). Section numbers and position in the ‘BigBrain’ are shown in the upper left corner. 
Layers of the EC indicated by Arabic numerals. Abbreviations: AmbG, Ambient gyrus; PAG, 
Parahippocampal gyrus; HIP, Hippocampus. Layers of EC [4]: /, Layer 1 (Stratum moleculare); 2, 
Layer 2 (Stratum stellare); 3, Layer 3 (Stratum pyramidale); 3 diss and 4 diss, Substrata dissecantia; 
4, Layer 4 (Stratum magnocellulare); 5, Layer 5 (Stratum parvocellulare); 6, Layer 6 (Stratum 
multiforme). * in B: clusters of the superficial layer 3. 
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Therefore, the aim of the present study was to combine established and state-of-the- 
art data sets and tools into a workflow for a comprehensive analysis of the morphology of 
layer 2 pre-a islands in their reconstructed 3D environment. Specifically, this study aims 
(1) to characterize the morphology of the pre-a islands of layer 2 and their distribution 
along the longitudinal axis of the EC, and (ii) to measure and analyze distributional 
features of pre-a islands (“gradients’ based on intensity measurements) and their shape 
(‘compactness’) in the human EC. 

Based on the high-resolution 3D ‘BigBrain’ model, the left and right EC were anno- 
tated, followed by the generation of 3D surface meshes in Atelier3D (A3D, National 
Research Council of Canada, Canada, [14]), a software that was optimized to visualize 
and annotate the very large data set of the reconstructed ‘BigBrain’ model. Subsequent 
machine-learning based analysis was used to distinguish between pre-o islands and 
background. Data analysis and visualization of the pre-a islands were performed on 
cropped annotated parts of the whole brain sections of the ‘BigBrain’ [1]. The visualiza- 
tion of the EC with its pre-a islands in the ‘BigBrain’ data set was performed using the 
supercomputer JURECA at Jülich Supercomputing Centre (JSC) [15]. Combined they 
enabled the visualization and in-depth analysis of the pre-o islands, their sizes, shapes, 
and underlying gray values in the reconstructed environment of the complete EC and 
the entire human brain. 


2 Material and Methods 


2.1 Histological Processing and 3D-Reconstruction of ‘BigBrain’ 


In accordance with ethical requirements, a post-mortem human brain of a 65-year old 
male body donor with no medical history of neurological or mental illness was acquired 
during an autopsy performed under the body donor program of the University of Düssel- 
dorf [1] (#4863). The brain was sectioned in the coronal plane (slice thickness: 20 jum) 
and stained for cell bodies using modified silver staining. 7404 sections were acquired 
and digitized with an in-plane resolution of 10 jum, then down-sampled to 20 um and 3D- 
reconstructed, resulting in an isotropic resolution of 20 jum and a data set of about 1 TByte 
[1]. In addition, histological sections were digitized with high throughput bright-field 
microscopes (TissueScope LE120, Huron Digital Pathology), resulting in an in-plane 
resolution of 1 jum, which was used to verify the results at high-resolution. The image 
size for the latter was about 8-10 GByte per brain section image (8bit, bigtif format, 
uncompressed). 


2.2 Border Definition and Annotation of the EC in A3D 


The analysis of EC and adjacent regions was performed in high-resolution (1 jum) sec- 
tions of the ‘BigBrain’ available in MicroDraw (The Institut Pasteur, Paris, France, 
https://github.com/r03ert0/microdraw) (Fig. 2). The cytoarchitectonic criteria for iden- 
tifying the borders were based on previous cytoarchitectonic mappings of EC in ten 
post-mortem brains [16] according to criteria in literature [2, 17-21]. The EC was anno- 
tated in the images of the 3D-reconstructed ‘BigBrain’ (20 jum isotropic resolution) 
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Fig. 2. Sequence and input/output data of the used programs. The rounded boxes indicate software 
tools, while square boxes indicate the main processing steps. Hexagons indicate plug-ins used 
in ImageJ (U.S. National Institutes of Health, USA) [24], while the half-square/angular boxes 
show the results of each process. The respective programs were used online in a browser, on a 
Windows operating system, on a Unix operating system, and in an HPC environment. Border 
identification in Microdraw (The Institut Pasteur, Paris, France, https://github.com/rO3ertO/mic 
rodraw) was performed in sections of the ‘BigBrain’ with an in-plane resolution of 1 um, while 
all other steps were performed in 3D based on the 3D-reconstructed ‘BigBrain’ sections with an 
isotropic resolution of 20 jum. 


[1] using A3D [14] (Fig. 3). A3D was adapted in the framework of the international 
collaboration project *BigBrain' to visualize and annotate the spatial data of the 3D 
reconstructed histological “BigBrain’ data set (https://bigbrainproject.org/hiball.html). 
As a result, 77 annotations in the left hemisphere (ID3, blue/dark) and 92 annotations in 
the right hemisphere (ID2, green/light) were obtained (Figs. 2 and 3). In general, every 
25" section was annotated, but in cases where the 3D geometry, e.g. the thickness and 
total surface area of EC, changed considerably between the gaps of 25 sections, addi- 
tional annotations were performed in every fifth section. A3D was then used to create 
gapless 3D structures from the original 2D annotations. 

This resulted in individual 3D surface meshes of the EC of each hemisphere, com- 
puted in A3D (Fig. 2). The surfaces were checked for artifacts, iteratively adjusted, and 
corrected. The rough edges on the painted surfaces are delicately smoothed locally using 
normalized curvature operators in the normal direction [22, 23]. The smoothing method 
is applied on the 3D surfaces preserving their specific structures. This smoothing retains 
the area of the 3D triangular mesh as well as the volume inside the surface. It also avoids 
moving too far from the drawn points based on thresholds. Subsequently, the 3D Carte- 
sian coordinates of EC surface contours were exported to be processed in MATLAB® 
(MathWorks, USA), with x-coordinates indicating the mediolateral (sagittal) axis, y- 
coordinates the dorsoventral (horizontal) axis, and the z-coordinates the rostrocaudal 
(coronal) axis (Fig. 3). 
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25 mm 


coronal FD 


sagittal 


horizontal 


Fig. 3. Overview of the reconstructed sagittal, original coronal, and reconstructed horizontal 
section planes, as well as of the ‘3D view — volume’ (bottom right) of the ‘BigBrain’ in A3D. EC 
of the left hemisphere dark, right hemisphere light. The EC is visualized as corresponding surfaces 
within the 3D ‘BigBrain’ model, with the surfaces magnified in the cutout for better visualization. 


The coordinates of the surfaces resulted in dense binary masks of the EC and its 
pre-a islands. Furthermore, the coordinates were used to crop smaller parts of the whole 
brain images including EC (Fig. 2). In total, 1065 images for the left and 1063 images for 
the right hemisphere were cropped, resulting in a data set of about 1.2 GByte. Finally, a 
3D median filter (r = 2px) was used to reduce the signal-to-noise ratio and emphasize 
the pre-a island boundaries. 


2.3 Segmentation of Pre-o Islands in Ilastik 


Segmentation of pre-a islands of the left and the right hemisphere was performed sepa- 
rately using the machine-learning based ‘Pixel Classification’ workflow of ilastik (Euro- 
pean Molecular Biology Laboratory, Heidelberg, Germany) [25] (Fig. 2). After loading 
the cropped EC image stacks from A3D (20 jum isotropic resolution) and selecting all 
available descriptive image filters as features, a ‘training’ based on the Random Forest 
classifier was performed using training labeling for the classes background (Fig. 4, light) 
and pre-a islands (Fig. 4, dark). In total, training was performed per hemisphere on six- 
teen coronal (every 50" to 100" section), eight sagittal (every 100" section), and six 
horizontal (every 50" section) cropped EC sections. The 37 features include descriptors 
of pixel intensity (Gaussian Smoothing), edge-ness (Laplacian of Gaussian, Gaussian 
Gradient Magnitude, and Difference of Gaussians), and texture (Structure Tensor Eigen- 
values, Hessian of Gaussian Eigenvalues) including their respective scales (o: 0.3 (only 
Gaussian Smoothing), 0.7, 1, 1.6, 3.5, 5, and 10). The Random Forest classifier uses an 
ensemble of numerous individual decision trees, and here 100 trees were used, with a 
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class prediction made for each tree individually (the class with the most votes being the 
prediction of the model) [25]. After completion of the training, segmented image stacks 
for the left hemisphere (1065 images, 8 bit, 898 x 712 pixels, 641 MByte) and right 
hemisphere (1063 images, 8 bit, 898 x 640 pixels, 583 MByte) of EC were exported 
(Fig. 4). 


segmentation 


Fig. 4. Segmentation of pre-a islands in ilastik (here section number 700, left hemisphere). Train- 
ing was performed with cropped EC images (left image), which were labeled (middle image) for 
the classes background (light) and pre-a islands (dark). The completion of the training resulted in 
segmented images (right image) showing the pre-a islands (dark) and background (light). 


2.4 Analysis of Pre-c« Islands in ImageJ 


ImageJ was used to automatically quantify and describe the properties of the pre-a 
islands segmented with ilastik (Fig. 2). The detected pre-a islands were individually 
labeled and determined by their centroid coordinates. The analysis of the segmented 
pre-a islands in their 3D environment allowed to extract a set of parameters (e.g. 3D 
coordinates, volume, and compactness). 

The compactness (range between 0 and 1) describes the ratio of pre-o islands volume 
and the smallest possible surrounding sphere including the individual pre-a islands. A 
value of 1 corresponds to a sphere and lower values to more complex structures. 

Additionally, mean, mode, and standard deviation of the gray values were taken 
from the original *'BigBrain' images (8 bit, 20 um resolution, unprocessed) at the cor- 
responding positions of the segmented island. Cell islands that were smaller than five 
voxels (voxel size 20 um isotropic) were removed based on the definition of islands as 
structures of three or more clustered neurons. 

Finally, the mean gray values were inverted (x(gray value) — 256-value), which 
resulted in lower and higher mean values corresponding to lower and higher cell-packing 
densities, respectively. Gradients of these gray values were plotted along the three main 
axes of the brain. Pearson correlations were computed including squared correlation 
coefficients, and the correlations were analyzed for significance. To exclude a possible 
bias of the gray level gradients due to the histological staining or the global correction 
of the intensities of the 'BigBrain' data set, the gray levels of the entire entorhinal cortex 
without the pre-c-islands were additionally determined. The pre-o islands of the cropped 
EC images were 3D visualized using the ImageJ ‘3D viewer’ plug-in. 
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2.5 Visualization of the EC and the Included Pre-c Islands in the Context 
of the Entire ‘BigBrain’ Data Set 


To visualize the EC and the included pre-o islands in the context of the entire “BigBrain’ 
data set, 2D segmentation images of both structures were stacked along the rostrocaudal 
axis to create 3D volumes (Fig. 2). These were subsequently converted to a custom data 
format which can be displayed in the interactive 3D atlas viewer of the Human Brain 
Project [26] hosted by the EBRAINS infrastructure (https://ebrains.eu/). 

In addition, 3D triangular surface meshes (left hemisphere (ID3): 1886051 points 
and 3781098 triangles for islands, 2854912 points and 5709820 triangles for EC; right 
hemisphere (ID2): 1680293 points and 3369782 triangles for islands, 2743662 points and 
5487320 triangles for EC) were extracted from the volumes using the marching cubes 
algorithm [27] to visualize the 3D appearance of the structures. Due to the large size 
of the data set of the EC in the whole BigBrain context (~37 GByte stack size, 6572 x 
1064 x 5711 voxels), which did not allow for post-processing on classical desktop PCs, 
surface extraction was performed by subdividing the volumes into 3D chunks, which 
were then processed in parallel using the supercomputer system JURECA [15] at Jülich 
Supercomputing Centre (JSC). 


3 Results 


3.1 Overview of the Layers of the EC 


The EC consists of six layers and almost cell free Substrata dissecantia (3 diss (external 
layer), 4 diss (internal layer) Figs. 1B, 5 and 6). Layer 2 harbored large stellate cells 
(modified pyramidal cells), forming pre-a islands of different size and shape, separated 
by neuropil. 

In layer 3, pyramidal cells clustered into columns in the intermediate part of EC 
(Fig. 1B). The Substrata dissecantia continued in parallel to the pial surface: The rostral 
part of EC revealed only the external Substratum dissecans (3 diss) (Figs. 5 and 6), 
whereas the intermediate and caudal portions of EC showed both 3 diss and 4 diss 
(Fig. 1B-C). Layer 4 showed large and darkly stained neurons and was clearly visible 
along the rostrocaudal extent of EC. No clear border was visible between layer 5 and 
layer 6 as they often seemed to intermingle (Fig. 1B-C). The border between layer 6 and 
the white matter was distinct only in the intermediate to caudal parts of EC (Fig. 1B-C), 
but not rostrally. 

Layer 2, Substrata dissecantia, and layer 4 of EC were specific to EC, but not to 
the neighboring structures. EC has borders with the amygdalopiriform transition area 
(APir), hippocampal-amygdaloid transition area (HATA), the parasubiculum (PaS), and 
BA35a (Figs. 5 and 6). 
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Fig. 6. 3D surface model of EC of the right hemisphere (designation as above for Fig. 5). *, 
intrarhinal sulcus [28] 


3.2 Cytoarchitecture of Layer 2 (Pre-o Islands) and Modifications Along 
the Rostrocaudal Extent 


The rostral part of EC showed the smallest (often narrow) pre-a islands separated from 
each other by little gaps (Fig. 1 A). Neurons within the islands were more densely packed 
than in the rest of the EC. In the intermediate part of EC, pre-o islands were characterized 
by the largest size and the largest gap from the neighboring islands (Fig. 1 B). In the caudal 
part of EC, pre-a islands were the most variable in their density and size (Fig. 1C). The 
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three parts of EC differed from each other by the type of the predominantly found islands: 
small and round islands were found in the rostral part (which also showed a few narrow, 
elongated islands, Fig. 1 A), elongated in the intermediate part, and round in the caudal 
part. 


3.3 Surface and Morphological Features of Pre-w Islands in EC 


Annotations for each hemisphere were stacked, smoothed while keeping the volume 
constant and visualized as 3D surfaces in A3D (Figs. 2, 5 and 6). On both sites, the 
EC shows the broadest extent in its intermediate part, while it becomes more compact 
in the most rostral and most caudal parts (Figs. 5 and 6). The peek in the intermediate 
part of EC belongs to the AmbG that is ventrally limited from the PhcG of EC by the 
intrarhinal sulcus [28]. The intrarhinal sulcus is clearly visible in the right hemisphere, 
where it can also be identified in the coronal sections (see asterisk in Fig. 6), while in the 
left hemisphere it was shallow (Fig. 5). The cortex of the left hemisphere was slightly 
damaged during the sectioning process of the paraffine embedded brain (see asterisk in 
Fig. 5). Therefore, an elevation of the surface could be observed in the 3D model (Fig. 5). 
This did not, however, affect the quality of the annotations and surface extractions of 
EC, as well as subsequent segmentation and analysis of pre-a islands (Fig. 7). 


Dorsal Dorsal 


Caudal Rostral $ ek Rostral Caudal 


Ventral 


Fig. 7. Visualization of pre-o island surfaces in the EC: left and right hemispheres. The asterisks 
in the left hemisphere indicate regions with cutting artifacts: *, small cuts in the cortex; **, artifact 
of the digitally repaired 3D reconstruction of ‘BigBrain’; ***, incised and downward protruding 
cortex. Abbreviation: AmbG, ambient gyrus 


The 3D models of pre-a islands revealed a variety of different shapes. Three main 
groups were identified according to their measured compactness: round, elongated, and 
complex islands (Figs. 7 and 9). The comparison of the various shapes of the complex 
islands in 3D and the shapes seen in 2D histological sections indicated that the complex 
islands were composed of interconnected round and elongated ones that compose layer 
2 of EC. 
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The pre-a island surfaces showed further structural differentiation (based on dis- 
tribution, size, and shape of the islands) between the rostral, intermediate, and caudal 
parts of EC (Fig. 7). The rostral part showed the smallest and most densely packed pre-a 
islands of predominantly elongated and round shape. The intermediate part of EC was 
less densely packed than in its rostral part and revealed the largest islands with the widest 
gaps between them. In addition to elongated and round islands, the intermediate part 
showed the largest number of complex islands. The caudal part of EC included round 
and complex islands, which were larger and less densely packed than in the rostral part 
of EC, but more densely packed and smaller than in its intermediate part. 


3.4 Number and Distribution of Pre-co Islands 


In total, 2045 pre-a islands in the left and 1841 pre-a islands in the right hemisphere 
were found. In the following, a gradient of pre-o islands corresponds to the mean gray 
values for each pre-a islands as a correlate of their cell-packing density (Fig. 8). 

The medial versus lateral gradient of pre-a islands indicated that the cell-packing 
density was higher in the lateral than in the medial EC, for both the left (slope — 0.0058; 
R? = 0.0158) and right hemispheres (slope = —0.018; R? = 0.12). 

Regarding the dorsal to ventral gradient of pre-a islands, a slight increase in the 
cell-packing density was observed in the left hemisphere (slope = 0.0198; R? = 0.0741) 
and a minimal decrease in the right hemisphere (slope = —0.006; R? = 0.0075). 

Finally, the gradient from caudal to rostral showed a slight decrease in cell-packing 
density in both the left (slope = —0.0018; R? = 0.0015) and right (slope = —0.0049; 
R? = 0.0176) hemispheres. All correlations were significant (p < 0.05), except the 
correlation along the dorsal-ventral axis of the left hemisphere, which did not reach 
significance (p = 0.0840). 

The analysis of compactness showed that the morphology of pre-a islands varied 
between elongated, complex, and round, whereby the transitions between them were 
rather smooth (Fig. 9). Moreover, the histograms were left-skewed, indicating that 
there were more round and complex than elongated islands in EC of both hemispheres. 
Finally, the distribution of the different types of islands was highly comparable in both 
hemispheres. 

The data are publicly available as part of the multilevel human brain atlas in the 
EBRAINS infrastructure and can be explored interactively in the EBRAINS 3D atlas 
viewer (https://atlases.ebrains.eu/viewer). 
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Fig. 8. Gradients of pre-a islands in EC, represented as mean gray values per pre-a islands, as 
a correlate of the cell-packing density within each island. Each ball (dark gray, left hemisphere; 
light gray, right hemisphere) represents a 3D centroid, thus a pre-o island. Medial versus lateral 
gradient: Note the difference between the hemispheres. Since the surfaces for each hemisphere 
were extracted separately and the values are sorted by their x-coordinates, the mean gray values 
in the left hemisphere are indicated from lateral to medial (outside to inside), while in the right 
hemisphere they are indicated from medial to lateral (inside to outside). Dorsal to ventral gradient: 
Note a longer line related to the left hemisphere, which is a reflection of an incised and downwards 
protruding cortex in the intermediate and caudal parts of the left EC (see asterisk in Figs. 5 and 
7). Caudal to rostral gradient: Note that the z-coordinate goes from the most caudal (posterior) 
to the most rostral (anterior) coordinate. 
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Fig. 9. Histogram indicating the frequency of compactness (from 0.12 to 1 in steps of 0.04) of 
pre-o islands in the left hemisphere (in total 2045 pre-o islands; upper histogram) and in the 
right hemisphere (in total 1841 pre-a islands; lower histogram) of the ‘BigBrain’ EC. For the left 
hemisphere, three examples are given for the different types of pre-a islands: one elongated (range 
of compactness (0.32, 0.36]), one complex (range of compactness (0.60, 0.64]) and one round 
(range of compactness (0.92, 0.96]) island. 
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4 Discussion 


Based on the high-resolution data set of the 3D-reconstructed ‘BigBrain’ [1], we created 
a workflow that allowed a comprehensive analysis of the individual morphology and 
distribution of pre-a islands in layer 2 of the EC, by using existing and new software 
tools. The visualization and annotation software A3D was used to annotate the EC in 
both hemispheres, followed by the preparation of cropped EC images in MATLAB®. 
The machine-learning based ‘Pixel Classification’ tool of ilastik [25] was trained to 
identify and segment pre-a islands. ImageJ [24] was used to visualize the 3D surface 
of the segmented pre-a islands and to analyze individual pre-a islands in their spatial 
extent, including intensity and geometric measurements. Finally, whole brain section 
masks of the EC and the included pre-a islands were used to visualize the results in 
its native 3D environment employing marching cubes algorithm at the supercomputer 
system JURECA [15] at Jülich Supercomputing Centre (JSC). The required programs 
and the input and output files generated in each case are shown as a flowchart in Fig. 2. 

The applications shown here using the ‘BigBrain’ data set enable a better understand- 
ing of the morphology of pre-o islands and a description of their underlying cytoarchi- 
tecture. Although pre-a islands exist only in the EC, there are other brain areas that show 
structures of distinct morphology. For example, the cell clusters of the amygdalopiriform 
transition area, for which our workflow, i.e. the annotation in A3D, labeling of structures 
in ilastik, and their 3D visualization in ImageJ, could be applied or adapted. 

By focusing on the morphological characteristics of pre-a islands in the ‘BigBrain’ 
model, numerous islands of a specific shape were analyzed. Compactness analysis, per- 
formed for the first time at high-resolution in 3D space, revealed round and complex 
islands as the predominant island types. In addition, 3D characterization of pre-a islands 
using the ‘BigBrain’ model allowed the identification of differences based on the dis- 
tribution, size, and shape of islands in the rostral, intermediate, and caudal part of EC. 
Here, the largest number of complex islands was observed in the intermediate part of EC. 
Previous approaches indicated the complexity of the pre-a islands by reporting “bridges” 
[8, 12, 13]. However, the analysis of complex islands based on 2D images is complicated 
and error-prone because they do not allow following islands between sectional planes. 
Thus, classical histological 2D analysis can only provide a limited description of the 3D 
shape and size of pre-a islands. As a result, the number of complex islands observed in 
2D sections is lower than the number obtained from 3D analyses. 

Regarding interhemispheric differences in the layer 2 characteristics, the present 
study found more pre-a islands in the left than in the right hemisphere by comparing the 
numbers of 3D centroids (i.e. pre-a islands) that were exported from ImageJ. Although 
this result is based on only one human brain, it is supported by a previous study that 
found a higher number of pre-a neurons in the left than in the right EC, in 18 out of 22 
human brains analyzed [21]. 

To our knowledge, density gradients based on intensity measurements, in particular, 
gray values corresponding to the cell packing densities, of all pre-a islands embedded in 
layer 2 of EC have not been discussed or raised in previous studies. Nevertheless, pre- 
vious cytoarchitectonic and immunohistochemical studies are in support of the present 
results on the rostral to caudal increase of cell packing density. Heinsen et al., 1994 
[21], observed such a gradient on the basis of sections Nissl-stained with gallocyanin, 
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whereas Beall and Lewis (1992) [29] described increases towards the caudal extent of EC 
based on light transmittance of phosphorylated neurofilament protein immunoreactivi- 
ties. Interestingly, anatomical and electrophysiological studies regarding the hippocam- 
pus provided evidence for a gradient along its dorsoventral axis in rodents, while gene 
expression studies indicated distinct functional domains with often clearly demarcated 
borders [30]. These observations led to the conclusion that the well-known view that 
spatial navigation and memory, and thus cognitive functions, are mediated in the dorsal 
(or posterior) parts of the hippocampus, whereas emotional responses are mediated in 
its ventral (or anterior) part, should be reconsidered. It is now proposed that hippocam- 
pal functional gradients may be superimposed on distinct functional domains at both 
the anatomical and mRNA levels [30]. Regarding the functional topography of the EC, 
Navarro Schróder et al., 2015 [31], performed a high-field functional MRI study (at 7 T) 
in which two groups of independent participants were presented with either images of 
scenes (for spatial stimuli) or images of objects (for non-spatial stimuli). They showed 
evidence of functional division along the anteroposterior axis, as the anterior EC showed 
higher responses to non-spatial stimuli and the posterior EC showed higher responses to 
spatial stimuli. It would be interesting to investigate whether there is a functional gradi- 
ent along the rostrocaudal (anteroposterior) axis of EC that correlates with the density 
gradients we found in the present study. Finally, whether these then correlate with the 
proposed gradients along the hippocampal long axis. 

Previous cytoarchitectonic studies subdivided the EC based on the differences found 
in all layers (into two subareas in Brodmann (1909) [7] to eight subareas in Insausti 
et al., 1995, 2017 [2, 17]). However, the present study describes the heterogeneity of 
pre-a islands at the level confined to one layer, layer 2. In the intermediate and caudal 
parts of EC, we identified conspicuous islands of different shapes (our complex islands) 
in addition to the island types described by Insausti et al., 1995 [2]. These islands were 
not clearly separated from each other and appeared to be composed of interconnected 
islands of other types, resulting in a complex morphology. Braak and Braak (1992) [8] 
identified three major subareas (medial, lateral, and central), based on specific features 
of entorhinal layers in different parts of EC. The authors considered bipartition of pre- 
a and pre-B layer (layer 2 and 3 of Zilles (1987) [4]) as a criterion for the medial 
subarea which covered the ambient gyrus. Moreover, a characteristic tripartition of the 
pri-a layer (layer 4 of Zilles (1987) [4]) marked their central subarea, which covered 
a part of the parahippocampal gyrus [8]. In contrast, based on the distribution of pre-a 
islands, the present study suggests that the ambient gyrus partly comprises the rostral 
and intermediate parts of EC. 

However, since only one ‘BigBrain’ was analyzed as a use case in the present study, 
but 3D-reconstructions of two other 'BigBrains' are currently under development, we 
intend to apply our workflow to these *BigBrain' models. This will facilitate visualization 
of pre-a island surfaces and further descriptive analyses, but could also enable quan- 
titative analysis of possible EC subareas. Further application of the method presented 
here to other high-resolution brain models will also allow continued investigation of the 
interindividual variability of pre-a islands at the highest resolution level. 

Taken together, the present study demonstrated a new workflow for data analysis and 
visualization of EC and its embedded pre-a islands using machine-learning based and 
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image data analysis tools. With these tools, we demonstrated new characteristics of the 
human EC and pre-a islands embedded therein, in particular the existence of complex 
islands and the rostral to caudal cell-packing density gradient. As pre-a cells of EC are 
involved in the functionally important perforant pathway, it becomes clear why it is so 
important to understand the natural structure of EC in order to detect possible changes 
during aging and in neuro-/psychiatric diseases. The results of this study will be made 
available to the neuroscientific community via the research infrastructure EBRAINS and 
will be linked to the multilevel human brain atlas of EBRAINS. The representation of the 
surfaces of the EC and the islands contained therein is of particular interest to a broad 
readership. The first-time visualization of these microscopic structures in the context 
of the entire brain represents an important bridge between basic research and possible 
medical applications, as it is common in medicine to view the brain in its natural three- 
dimensional environment. Our current results could provide a reference model for future 
studies of neurodegenerative and psychiatric diseases such as Alzheimer's, Parkinson's, 
and schizophrenia. 
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Abstract. The human lateral geniculate body (LGB) with its six sickle shaped 
layers (lam) represents the principal thalamic relay nucleus for the visual system. 
Cytoarchitectonic analysis serves as the groundtruth for multimodal approaches 
and studies exploring its function. This technique, however, requires experienced 
knowledge about human neuroanatomy and is costly in terms of time. Here we 
mapped the six layers of the LGB manually in serial, histological sections of the 
BigBrain, a high-resolution model of the human brain, whereby their extent was 
manually labeled in every 301P section in both hemispheres. These maps were 
then used to train a deep learning algorithm in order to predict the borders on 
sections in-between these sections. These delineations needed to be performed 
in 1 jum scans of the tissue sections, for which no exact cross-section alignment 
is available. Due to the size and number of analyzed sections, this requires to 
employ high-performance computing. Based on the serial section delineations, 
high-resolution 3D reconstruction was performed at 20 jum isotropic resolution 
of the BigBrain model. The 3D reconstruction shows the shape of the human 
LGB and its sublayers for the first time at cellular precision. It represents a use 
case to study other complex structures, to visualize their shape and relationship to 
neighboring structures. Finally, our results could provide reference data ofthe LGB 
for modeling and simulation to investigate the dynamics of signal transduction in 
the visual system. 


Keywords: Lateral geniculate body (LGB) - Corpus geniculatum laterale 
(CGL) - BigBrain - Deep learning - 3D reconstruction - Cytoarchitecture 


1 Introduction 


The lateral geniculate body (LGB, /at. Corpus geniculatum laterale, from now on LGB) 
plays a key role in visual perception. Together with the medial geniculate body, which 
is involved in auditive processing, both nuclei constitute the metathalamus. The LGB is 
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located on the ventral surface of the brain. It mainly receives connections from the retina 
via the optic tract, but also from layer 6 of the visual cortex and the reticular nucleus 
of the thalamus [1]. The most prominent efferent projections reach the primary visual 
cortex, i.e. Brodmann’s area 17 (or area V1, or hOc1 [2]; via the optic radiation (Fig. 1 
top image) [1]. The human LGB consists of six layers. The two most ventrally located 
layers (layers 1 and 2) consist of larger neurons and are known as magnocellular, while 
layers 3 to 6 are parvocellular layers (Fig. 1 bottom image). Koniocellular neurons are 
located in between those laminae. 


Layers 2, 3 and 5 receive fibers from 
Optic chiasm 


ca. the ipsilateral eye, whereas layers 1, 4 and 6 

ater F 

geniculate priman visia receive fibers from the contralateral eye [3]. 
cortex I.e., each layer receives information from 


one eye only. Later on, the information will 
be merged to be processed and interpreted 
as a binocular image in the visual cortex [4]. 
Approximately 80% of the retinal informa- 
tion derive from midget ganglion cells and 
are transferred to the parvocellular neurons 
in the LGB in layers 3 to 6. These small 
neurons are specialized in object and detail 
recognition due to their ability of generating 

lateral a high spatial resolution and red-green color 
rostra | caudal vision [4]. Midget cells are characterized 
as small, color-sensitive slow adapting cells 
from the retina, contrary to parasol cells. 
Retinal parasol cells send impulses to the 
bigger magnocellular layers 1 and 2, which 
are functional for time resolution and hence 
for the perception of position and movement 
[5]. Non-Midget-non-Parasol ganglion cells 
from the retina project to koniocellular neu- 
rons of the LGB, which further project to the 
primary visual cortex, similarly to the parvo- 
and magnocellular systems. Since the LGB 
transfers retinal information directly to the 
Fig. 1. The human visual pathway (top primary visual cortex via the optic radiation, 
image). The location of the lateral geniculate it is also defined as first-order relay [1]. The 
body (LGB) is marked by the black rectangle. koniocellular neurons most probably play a 


Six layers of the LGB (bottom image) describ- role in color perception [4]. The middle part 
ing contralateral projections depicted in green of the LGB in coronal sections is called the 


and ipsilateral projections in red. Created with hilum, and the taperings at the medial and 
BioRender.com (Color figure online) lateral ends of the LGB are called the medial 
and lateral horn of the LGB, respectively [6]. 

Lesions in the LGB can affect the function of the visual pathway. For example, 
patients suffering from multiple sclerosis or Alzheimer's disease show a general volume 


lateral 


123456 
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loss of the LGB, indicating in the latter case a significant relation between degeneration 
of the LGB and amyloid-£ pathology [7, 8]. After loss of visual experience due to an 
injury of the primary visual cortex but with an intact LGB, certain visual information 
processing still exists. This phenomenon discovered by functional magnetic resonance 
imaging (fMRI) is called blindsight [9]. 

This work aims to map the LGB and its layers at microscopical resolution, to provide 
reference data for studies targeting the LGB in the living human brain, to develop a use 
case enabling the combination of expert annotations based on cytoarchitectonic criteria 
in a subset of sections with deep learning in order to increase the number of delineated 
sections. A 3D reconstruction of the human LGB in both hemispheres of the BigBrain 
dataset was computed and visualized [10]. 


2 Materials and Methods 


2. Histology 


The BigBrain (65-year-old male) comes from the body donor program of the Anatomical 
Institute of Düsseldorf in accordance to legal and ethical requirements. Prior to histo- 
logical processing MRI images were taken (1.5 T, Siemens Medical Systems GmbH, 
Erlangen, Germany) in order to provide a reference volume for subsequent image regis- 
tration. Histological processing was previously described in detail [10, 11]. In short, the 
brain was fixed in 496 buffered formalin, embedded in paraffin and cut in coronal plane, 
each Sect. 20 jum thick. Every section — of total 7404 — was mounted and silver stained 
for cell bodies. The sections were digitized using a TISSUEscope™ LE120 Scanner 
(Huron Digital Pathology) [10]. The spatial resolution of the images is 1 jum in-plane, 
the average size of the images is 10 GByte per section. 


2.2 Manual Analysis and Reference Mapping of Histological Sections 


High-resolution images were analyzed and manually delineated using SectionTracer, an 
online tool written in JavaScript [12]. Borders of the LGB and its six layers were traced 
in 16 sections per hemisphere, i.e. every 30" image at a distance of 600 jum. 

The human LGB is surrounded by white matter, i.e. fiber tracts, and could therefore 
be clearly distinguished from its neighboring structures: the thalamus was found dorso- 
medially from the LGB while the medial geniculate body was located medially. At rostral 
levels the LGB was completely surrounded by white matter. 

The borders of the six layers of the LGB were identified based on differences in the 
cytoarchitecture (Fig. 2). Magno- and parvocellular layers were mainly distinguished 
according to their cell-size and -density. The pale koniocellular neurons served as main 
indicators for the borders between the different layers of the LGB. Wherever these criteria 
were not sufficient, i.e. where two parvocellular layers were not separated by a konio- 
cellular lamina, cytoarchitectonic criteria such as size, shape, density and distribution of 
neurons were applied. Borders were drawn where the cytoarchitectonic pattern changed. 
Layers were numbered according to their position, starting from the most ventral layer 
1 at the brain surface, increasing to the uppermost dorsal layer 6. 
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Fig. 2. Shape and borders of the six layers (labeled by different colors and numbers) of the human 
lateral geniculate body (LGB) on coronal sections. Upper row shows a comparison between manual 
expert mapping (a) and the prediction of the CNN (b). The location of the LGB on the coronal 
section is shown in (c). One caudal (d) and one rostral section (e) indicate different shapes of the 
LGB at different levels of sectioning. Scale bars: a-b, c-e 1 mm, c 20 mm. Scalebar magnification 
in a: 50 jum. 


The volumes of the LGB and its layers were measured using Cavalieri's principle. 
A shrinkage factor of 1,931 was used for calculations in order to consider the shrinkage 
occurring during histological processing, the fixation, respectively [13]. 

The layers were then 3D reconstructed and transferred to the BigBrain space, which 
has a spatial resolution of 20 jum isotropic [12]. Results were visualized using the 
software ParaView [14, 15]. 


2.3 Training of the Deep-Learning Algorithm to Predict Missing Delineations 


A convolutional neural network (CNN) was trained to classify each image pixel accord- 
ing to each lamina of the LGB. Network architecture and training procedure were 
based on Schiffer et al. [16], as they have been used successfully to aid mapping of 
cytoarchitectonic areas in several cortical brain regions [17—20]. 

The workflow uses a U-shape architecture for brain area segmentation with two 
encoder branches capturing the input data at different spatial scales. Training and pre- 
diction was controlled by the web-based interface of the tool, and performed remotely on 
the supercomputer JURECA [21] at Jülich Supercomputing Centre (JSC). The training 
time on the HPC system was around 70 min. 
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After training, the CNN was applied to automatically classify layers of LGB in 
sections without annotations. Automatically created annotations were quality-checked to 
exclude misclassified sections. Annotations were then non-linearly transformed [22] into 
the 3D reconstructed BigBrain space [10]. Previously excluded sections were replaced 
by interpolation [23]. Finally, the marching cubes algorithm [24] was applied to extract 
a surface mesh for each layer of LGB. This 3D reconstruction step is not part of the 
tool provided by [16], but it follows directly the experimental protocol for the BigBrain 
dataset described therein. 


3 Results 


3.1 Cytoarchitectonic Mapping Based on Expert Annotations and Deep Learning 


The analysis of the LGB allowed to identify six layers, partially different in shape and 
size (Fig. 2). The two ventral layers contained magnocellular neurons of triangular and 
multiform shapes and differed significantly in the cytoarchitecture from the parvocellular 
layers 3-6. Layers 1 and 2 were thinner and more elongated; contained round and oval 
shaped neurons. Layers 3 and 6 were most prominent and were reached over a larger 
distance than the other two layers of the parvocellular part. Layer 5 was the shortest and 
least developed layer with respect to its mediolateral extent. On most sections, the layers 
were separated by a white koniocellular line, characterized by a low cell density. On a 
few sections, layers 1 and 4 as well as 4 and 6 were connected. The same was true for 
layers 2 and 3 as well as 3 and 5, where the koniocellular lines were lacking (Fig. 2a, 
black arrow). Although layers 1 and 2 were the thinnest structures, the magnocellular 
neurons are clearly visible, due to the magnocellular neurons (Fig. 2a, inset). Layers 3 
to 6, on the other hand, were composed of smaller and more densely packed cells. The 
neurons on the lateral and medial horn of the LGB in each layer were loosely packed, 
while the neurons at the hilum were denser. 

In total, 13 sections were labeled by an expert in the left hemisphere and 11 sections 
in the right hemisphere, with a distance of 0.6 mm. The LGB of the left hemisphere 
was found and processed by the CNN on 366 sections (rostro-caudal extent 7.3 mm) 
and on 293 coronal sections (extent of 5.9 mm) of the right hemisphere. For comparison 
between the expert annotation and the prediction of the CNN see Fig. 2a and b. 
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3.22 High-Resolution 3D Reconstruction 


The exact location of the human LGB is depicted in the BigBrain model in Fig. 3a. 
A high-resolution 3D reconstruction was performed to get a deeper insight into the 
complex shape of the human LGB. The prominent hilum and the elongated lateral horn 
are characteristic for the shape of the human LGB (Fig. 3b). In more detail, the dorsal 
surface of the LGB is mainly covered by layers 4, 5 and 6. Layer 4 is more prominent 
at the hilum and the medial horn (Fig. 3b), while layer 3 is most prominent at the lateral 
horn (Fig. 3c). Most parts of the ventral surface of the LGB are covered by layer 1 
medially and by layer 2 laterally (Fig. 3c). 


3.3 Volumes of Layers 


The total volumes of the LGB, after consideration of the shrinkage factor, are 120.5 mm? 
on the left, and 113.2 mm? on the right hemisphere (Table 1). The parvocellular layers are 
bigger than the magnocellular layers. The total volume of the LGB is 141.2 on the left, 
and 132.2 mm? on the right hemisphere. It contains the six layers, but also koniocellular 
layers and blood vessels, which are not included in the volumes of the single layers. 


Table 1. Volumes (in mm?) of each of the layer of the lateral geniculate body (LGB) in both 
hemispheres, sum of layers 1—6, and total volume. 


LGB Left [mm?] | Right [mm?] 
Layer 1 10.2 7.8 
Layer 2 8.3 7.4 
Layer 3 28.5 27.4 
Layer 4 25.9 21.3 
Layer 5 20.2 21.8 
Layer 6 27.4 27.5 
X of layers 1-6 | 120.5 113.2 
Total 141.2 132.2 
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Fig. 3. 3D reconstruction of the lateral geniculate body (LGB). (a) Localization of the left lateral 
geniculate body (LGB) in the BigBrain model. (b-c) Surface of the reconstructed left and right 
LGB revealing its specific shape and its different layers. (b) dorso-caudal view (c) ventro-rostral 
view. Abbreviations: d = dorsal, 1 = lateral, m = medial, v = ventral. 
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4 Discussion and Conclusion 


This study presents the first application for mapping a complex subcortical gray matter 
structure, the LGB, combining an expert-based and a deep-learning approach, resulting 
in a high-resolution 3D reconstruction of the LGB in the BigBrain template. It shows, 
for the first time, the shape of the individual six layers in 3D space while previous 
information was mainly obtained from 2D sections. The maps are publicly available 
in the EBRAINS multilevel human brain atlas (https://ebrains.eu/service/human-brain- 
atlas), where they can be explored in an interactive 3D viewer (https://interactive-vie 
wer.apps.hbp.eu/saneUrl/BigBrain LGB) and can be found under the DOI: https://doi. 
org/10.25493/33Z0-BX. 

Due to the large size of the human brain, subcortical nuclei such as the LGB and 
cortical areas are found in tens to hundreds to thousands of sections, in dependence on 
the size of the structure. To map a structure manually in every section using traditional 
methods becomes impossible when structures are complex or large. As an alternative 
option, the extent of the structures can be interpolated. The drawback of this method 
is that images have to be 3D reconstructed before further processing [16]. Herein, we 
provide a use case using a semi-automated prediction of borders supported by deep 
learning. The algorithm learns from manually annotated borders and is able to use this 
knowledge for annotating the same area in previously unseen sections. Since it directly 
interprets the texture and brain topology, it is much more precise than a 3D interpolation 
in the reconstructed space and allows to use unregistered single sections. 

Due to the large number of images and large size of the sections, training and appli- 
cation of the CNN was performed on the supercomputer system JURECA [18] at Jülich 
Supercomputing Centre (JSC). The use of high-performance computers becomes even 
more relevant when larger structures are being analyzed in whole brain sections. While 
computation in the LGB was completed in 2 h, comparable computations for the thala- 
mus, for example, would take 50 h. Existing work on automatic classification of cytoar- 
chitectonic cortical areas [16] also indicate that computational effort increases consid- 
erably when trying to automatically identify larger brain regions with more complex 
cytoarchitectonic and morphological properties. 

Our findings of the total volume of the LGB, shrinkage factor included, is in line with 
previous findings. Andrews and colleagues reported mean LGB volumes of 121 mm? for 
the right and 115 mm? for the left hemisphere; the variance was quite high and ranged 
from 91.1 to 157 mm? for both hemispheres [25]. Further investigations in additional 
brains in the future would help to better understand intersubject variability in terms of 
the size and/or shape of the LGB. 

High-resolution mapping data of the LGB may open a broad field of applications. 
For example, routine Magnetic Resonance Imaging (MRI) studies often lack sufficient 
contrast and/or spatial resolution and could benefit from such atlas data. Current studies 
on the implementation of electrically stimulated prostheses in the visual cortex aim 
to restore part of the vision in blind people by multiple stimulations of electrodes to 
percept light [26, 27], where such maps could be applied in the future to increase the 
localization accuracy. The investigation of the pathomechanisms of diseases where the 
visual pathway is affected, such as Multiple Sclerosis or glaucoma could be supported by 
the maps [28, 29]. The maps provide input data for modelling and simulation of different 
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types of neurons, neuronal pathways or networks using platforms like The Virtual Brain 
[30]. With respect to basic neuroscience, the visualization and differentiation of the 
layers is expected to contribute to a more in-depth analysis of information processing in 
visual pathways. The present approach provides a use case for application in other brain 
areas and brains of other species enabling a fast and detailed prediction of the extent 
of small and complex structures, including visualization and volume analyses, with a 
minimum of manual effort and time expenditure. 
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Abstract. Nowadays, clinicians have multiple tools that they can use 
to stimulate the brain, by means of electric or magnetic fields that can 
interfere with the bio-electrical behaviour of neurons. However, it is still 
unclear which are the neural mechanisms that are involved and how the 
external stimulation changes the neural responses at network-level. In 
this paper, we have exploited the simulations carried out using a spiking 
neural network model, which reconstructed the cerebellar system, to shed 
light on the underlying mechanisms of cerebellar Transcranial Magnetic 
Stimulation affecting specific task behaviour. Namely, two computational 
studies have been merged and compared. The two studies employed a 
very similar experimental protocol: a first session of Pavlovian associa- 
tive conditioning, the administration of the TMS (effective or sham), a 
washout period, and a second session of Pavlovian associative condition- 
ing. In one study, the washout period between the two sessions was long 
(1 week), while the other study foresaw a very short washout (15 min). 
Computational models suggested a mechanistic explanation for the TMS 
effect on the cerebellum. In this work, we have found that the duration of 
the washout strongly changes the modification of plasticity mechanisms 
in the cerebellar network, then reflected in the learning behaviour. 


Keywords: Brain simulation - Cerebellum - Spiking Neural 
Networks - TMS - Neurostimulation 


1 Cerebellar Transcranial Magnetic Stimulation 


Transcranial magnetic stimulation (TMS) is a noninvasive technique that can be 
used to study, diagnose, or treat neural pathologies. A coil induces a magnetic 
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field that generates an electric field in the brain tissue. The electric field directly 
interferes with nervous system functions by changing the electrical behaviour of 
neurons. 

Among the different protocols that are available for TMS, continuous theta- 
burst stimulation is usually delivered to influence long-term plasticity changes. 
The stimulation protocol can consist of pulse bursts 50 Hz repeated every 200 ms, 
given in a continuous train lasting tens of seconds. The stimulation intensity is 
calibrated using the active motor threshold, defined as the lowest intensity stably 
evoking motor-evoked potentials. Common values for the stimulation amplitudes 
for theta-burst TMS range between 0.5 and 0.7 kV, generating a peak magnetic 
field of ~1 T reaching a depth of 20-30 mm from the scalp surface [13]. 

Cerebellar TMS foresees the administration of the stimulation over one cere- 
bellar hemisphere or the cerebellar vermis, and it can be used during cerebellar- 
driven protocols to interfere with the learning processes at neural level. It has 
been shown that cerebellar TMS stimulation influences the learning processes, 
but the underlying mechanisms are still unconfirmed [7, 10, 12, 16]. 

In order to better understand the effects of cerebellar TMS, both experimen- 
tal and computational approaches have been used in the last years. 


2 Experimental Protocols 


Monaco and colleagues [13, 14] have employed TMS stimulation on human partic- 
ipants between two sessions of eyeblink conditioning protocol (EBC), a temporal 
associative task in which the subject learns, thanks to cerebellar plasticity, the 
precise timing associations between two stimuli. In EBC, the participant learns 
the association between a neutral conditioned stimulus (e.g., a sound) and a 
following unconditioned stimulus, eliciting an eyeblink (e.g., an electric shock 
near the eye). Initially, subjects respond with a reflexive eyelid closure, following 
the unconditioned stimulus. Along with learning of this association, participants 
start to express a Conditioned Response (CR), anticipating the unconditioned 
stimulus. 

The two EBC studies [13,14] have common features, such as the presence of 
two consecutive sessions of EBC (session; and sessiong), each one composed 
of 6 blocks of acquisition, where two stimuli were provided to the subject, and 
one block of extinction, where only one stimulus was provided to the subject. In 
the acquisition phase, the subject learns the timing association between the two 
stimuli, thus exhibiting an increasing percentage of correct CRs. In the extinction 
phase, the subject unlearns the association between the two stimuli, since the 
second one no more follows the first one. In both studies, an effective or a sham 
cerebellar T'MS stimulation was administered at the end of the first session. The 
studies aimed at investigating the behavioural differences between the TMS and 
the control (sham) groups in the second session. 

There are some minor discrepancies between the two experimental protocols 
(see Table 1), but the most important one is the washout period between the 
first and the second session of EBC: a long one (i.e., 1 week) in Monaco et al. 
2014 [13], and a short one (i.e., 15 min) in Monaco et al. 2018 [14]. 
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Table 1. Details of the two EBC experimental protocols 


Property Monaco et al. 2014 | Monaco et al. 2018 
Number of subjects for each group | 11 12 

Number of groups 2 3 

Stimulated hemisphere Right Right or left 
Number of trials per block 10 11 

Inter-stimulus interval 600 ms 400 ms 

Washout period 1 week 15 min 
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In both experimental datasets, changes in the learning and unlearning trajec- 
tories of CRs between the TMS and the control groups were observed. Namely, 
with a long washout, the TMS group showed impairment in the extinction phase 
of sessions (from block 6 to block 7), where the unlearning resulted in being 
slowed down (Fig. 1.A). With a short washout, the TMS groups (right or left 
stimulated hemisphere) showed a less effective relearning phase at the beginning 
of acquisition (block 1 of sessiona) and, again, a slowed down unlearning during 
the extinction phase (Fig. 1.B). 

Overall, the findings from both experimental studies suggested that TMS can 
impair memory consolidation processes in the cerebellum, possibly by interfering 
with memory transfer from the cerebellar cortex to deeper structures. However, 
due to the noninvasive nature of TMS investigation, it was possible to only 
speculate about the putative mechanisms underlying the behavioural differences. 
To have a greater insight about the neural processes involved by the TMS, a 
computational approach was used in two previous works [1,3]. 


3 Computational Modelling 


Historically, computational models of the brain have been widely used to acquire 
new knowledge that cannot be obtained from physiological studies and abstract 
theories. These models proved to be a powerful tool that can support the other 
approaches in tackling the complex problem of understanding how the brain 
works. Spiking Neural Networks (SNNs) have been used to mimic the neural 
organization, employing single units (i.e., the neurons) organized and connected 
similarly to the relative biological structures [4,9]. 

Recently, a detailed spiking neural network model of the cerebellar microcir- 
cuit proved able to reproduce multiple cerebellar-driven tasks [5], among which 
the EBC paradigm [2]. Having been validated, the SNN model was challenged 
to fit the two experimental datasets recorded by Monaco and colleagues from 
human subjects [13,14], with a long [1] or a short washout [3]. In both studies, 
the SNN model was able to capture the specific alterations in the second EBC 
session caused by TMS interference. Indeed, TMS affected motor response evolu- 
tion along task repetitions, and we inferred the underpinning plasticity changes 
over the whole network. 
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Fig. 1. CR percentages in experimental and modelling studies. A) Box-plots of CR 
percentages along the seven blocks of the EBC protocol with a long washout (1 week) 
between session; and sessions. Left panel: session;, Central panel: sessiona with 
sham TMS, Right panel: sessionz with effective TMS. Boxes represent the upper and 
lower quartiles, the whiskers identify the range, and the red lines represent median 
values. Black boxes represent experimental data from [13] and blue boxes represent 
modelling data from [1] B) Box-plots of CR percentages along the seven blocks of the 
EBC protocol with a short washout (15min) between sessioni and session2. Grey 
boxes represent experimental data from [14], and white boxes represent modelling data 
from [3]. (Color figure online) 


The SNN cerebellar microcircuit was populated with leaky Integrate&Fire 
neurons, distinguishing between different neural groups. Mossy Fibers (MFs), 
the input neurons of the system, encode the first (conditioned) stimulus. In fact, 
it has been shown that these neurons encode the state of the system (e.g., the 
presence of a certain sound). Granular Cells (GrCs) represent in a sparse way 
the input from the MFs. Inferior Olive neurons (IOs), the other input to the sys- 
tem, encode the second (unconditioned) stimulus, since this neural population 
is active in presence of pain. Purkinje Cells (PCs) integrate the sparse infor- 
mation coming from the GrCs through the Parallel Fibers (PFs), while Deep 
Cerebellar Nuclei (DCNs), the only output of the cerebellar microcomplex, acti- 
vate the motor response (i.e., the anticipatory CR). While the network structure 
and connectivity are the same in the two computational studies (Fig. 2), the 
number of neurons for each population is different, as reported in Table 2, since 
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the SNN used in Antonietti et al. 2018 is three-times larger than the one used 
in Antonietti et al. 2016. Both studies used EDLUT simulator environment to 
perform the neural simulations [17]. 

The connectivity between the different neural populations follows the same 
rules. MFs send projections to the GrCs, each GrC receives input from 4 MFs; 
IOs send one-to-one teaching connections to PCs; DCNs receive both excitatory 
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Fig. 2. Spiking Neural Network model used to simulate the cerebellar circuit. Both 
computational studies [1,3] used the same network architecture, but with a different 
number of neurons for each neural population (see Table 2). Circles represent neurons 
and lines represent synaptic connections. Plasticity sites are marked by orange labels. 
The first (conditioned) stimulus is encoded by MFs, while the second (unconditioned) 
is encoded by IOs. The activity of DCNs is the network output and generates CRs. 
(Color figure online) 


Table 2. Number of neurons and synapses of the SNNs cerebellar models 


Neural population | Antonietti et al. 2016 | Antonietti et al. 2018 
MFs 100 300 
GrCs 2000 6000 
IOs 12 36 
PCs 12 36 
DCNs 6 18 
Synaptic connection 

MF-GrC (static) 8000 24000 
PF-PC (plastic) 19164 172806 
IO-PC (static) 12 36 
MF-DON (plastic) 600 5400 
PC-DCN (plastic) 12 36 
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inputs directly from MFs and inhibitory synapses from PCs. The SNN models 
has three plasticity sites, at cortical level (PF-PC) and at nuclear level, between 
MF-DCN and PC-DCN, all based on different kinds of Spike- Timing Dependent 
Plasticity (STDP) [6,11]. PF-PC plasticity is modulated by IO activity, MF- 
DCN by PC activity, while PC-DCN is a standard unsupervised STDP learn- 
ing, depending only on the difference between the pre- and post-synaptic firing 
times [8, 15,18]. Each learning rule encompasses two different plasticity mech- 
anisms: Long Term Depression (LTD), decreasing the synapse strength, and 
Long Term Potentiation (LTP), strengthening the connection. Therefore, each 
plasticity site can be characterized by two constants that regulate the amount 
of synaptic change. These constants cannot be directly computed from physio- 
logical data; as a result, we have treated those as free parameters of the SNN 
model, to be optimized according to the desired behaviour. The two computa- 
tional studies employed evolutionary algorithms to identify the best six parame- 
ters (PF-PC LTP, PF-PC LTD, MF-DCN LTP, MF-DCN LTD, PC-DCN LTP, 
and PC-DCN LTD) in each experimental condition (sessioni, sessionasnam, and 
sessionorms). 

Employing realistic SNN models, Antonietti and colleagues have shown how 
closed-loop simulations can be successfully used to fit real experimental datasets. 
Thus, the changes in the model parameters in the different sessions of the pro- 
tocol unveil how microcircuit mechanisms let implicitly emerge healthy and 
altered behavioural functions. In this work, we have analyzed the data gen- 
erated by experimental and computational studies, in order to clarify the role of 
the washout period, the main difference between the two datasets. 


4 Comparative Analysis 


The elements analyzed in the present study are the learning and unlearning 
trajectories (i.e., the variation in CR percentage between blocks) and the values 
of LTP and LTD parameters at the three plasticity sites that yielded to different 
behaviours in the simulations. 

The two computational studies have already demonstrated that the behav- 
ioural response generated by the model in each block was a good representation 
of the experimental recordings. Figure 1 shows that the CRs generated by the 
model (blue boxes in Panel A, white boxes in Panel B) were comparable with 
the experimental results (black boxes in Panel A, grey boxes in Panel B). The 
degree of variability of the experimental data was higher than the computational 
studies, but a certain degree of variability was maintained. The variability in the 
results generated by SNN models was due to the fact that multiple combinations 
of LTP and LTD parameters have been considered. In fact, the evolutionary algo- 
rithm identified a family of optimal parameter combinations, leading to similar 
performances. 

For the short washout protocol, two distinct TMS groups were identified, one 
receiving a stimulation of the left cerebellar hemisphere, the other one receiving 
the stimulation on the right hemisphere. Since there were no significant differ- 
ences between the CRs recorded from the two groups, in the present study, the 


Computational Modelling of Cerebellar TMS: The Effect of Washout Al 


results of both groups have been merged in a single TMS group, then compared 
to the sham group. In this way, we can make a direct comparison between the 
two studies, considering only two groups: sham vs TMS. 

We focused our analysis on the two salient phases that were changed in 
sessions for the TMS group with respect to the sham group. Namely, the fast 
acquisition (from zero to block 1, Fig. 3.A) and the extinction (from block 6 to 
block 7, Fig. 3.B). After a prolonged washout, the percentage of CRs acquired 
in the first block did not differ between the sham and the TMS groups. On the 
other hand, TMS slowed down the fast learning phase when a short washout 
interleaved the two EBC sessions (73% sham, 54.5% TMS, Fig. 3.A). 

Conversely, the TMS administration interfered with the extinction phase 
after both long and short washouts. In fact, with a long washout the TMS group 
decreased the percentage of CRs of only —30% [—40 —10], while the sham group 
unlearnt faster —50% [—60 —40]. The same behaviour was observed with a short 
washout, where the extinction rate was —36% [—55 —27] in TMS group and 
—55% [-64 —45] in the sham group. 
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Fig. 3. Comparison of learning and unlearning rates in sessioni, session2snam, and 
session2r Ms. A) Fast learning rates, i.e., increase in CRs in the first acquisition block 
with long (left) or short (right) washouts. B) Unlearn rates, i.e. decrease in CRs from 
the last block of extinction (block 6) to the extinction (block 7). Boxes represent the 
upper and lower quartiles, the whiskers identify the range, and the red lines represent 
median values. Blue boxes represent modelling data from [1] and white boxes represent 
modelling data from [3]. (Color figure online) 


Summarizing, the TMS affected both the fast acquisition and the extinction 
only with a short washout, while it impacted only the extinction phase with a 
long washout. We have then analyzed the LTP and LTD parameters of the SNN 
distributed plasticity that generated this different behaviour. 

Figure 4 illustrates the overall variations of LTP and LTD parameters for 
the cortical plasticity (PF-PC, Panel A) and the nuclear plasticities (MF-DCN, 
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Fig. 4. Changes for LTP and LTD Parameter at the Three Different Plasticity Sites: 
A) PF-PC, B) MF-DCN, and C) PC-DCN. The left column represents box-plots of 
the parameter change in sessiona for the TMS group with respect to the parameter 
values for the sham group. Boxes represent the upper and lower quartiles, the whiskers 
identify the range, and the red lines represent median values. Blue boxes represent 
modelling data from [1] and white boxes represent modelling data from [3]. The right 
column presents a synthesis of the parameter changes (LTP and LTD) and the overall 
effect on the CR generation. Plus/minus symbols indicate qualitatively the amount of 
increase/decrease of the parameter in the TMS group or an increase/decrease in CR 
percentage generation. Equal symbols indicate negligible changes. (Color figure online) 
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Panel B, and PC-DCN, Panel C). The left column reports the percentage vari- 
ation of the LTP and LTD parameters in the TMS group with respect to the 
sham group. An increased LTP parameter implicates a stronger potentiation of 
the synapses, while an increased LTD parameter entails a stronger weakening. 
The right column summarizes the combined effect of LTP and LTD changes and 
their effect on the CR generation process, that can be favoured or hindered. 

Considering the PF-PC plasticity, it is possible to observe that while the 
LTD constant was not influenced by the TMS stimulation, the LTP parameter 
was decreased with a long washout and increased with a short one. This cause 
an increased and a decreased generation of CRs in the long and short washout 
case, respectively. 

Considering the nuclear plasticities, the changes were limited for both LTP 
and LTD parameters in the long washout case. On the other hand, the changes 
were evident for the short washout case, where the variations of LTP and LTD 
parameters moved in favour of a generation of CRs. In fact, the LTD of the 
excitatory MF-DCN synapses was decreased, while the LTD of the inhibitory 
PC-DCN synapses was strongly increased. As a result, more excitatory inputs 
from the MFs and a weaker inhibition from the PCs increased the firing rate of 
DCN, thus generating more CRs. 


5 Discussion and Conclusions 


The analysis of LTP and LTD parameters showed that changes in behaviour with 
short and long washout are due to different rate in the rules driving synaptic 
modifications. 

In particular, it can be observed that the administration of TMS followed by 
a prolonged washout caused a significant modification of the cortical plasticity 
only, with minor involvement of nuclear plasticities. Besides, the decrease of 
the PF-PC LTP parameter was in favour of a generation of CRs, therefore the 
fast acquisition in sessionorMs was not impaired since the SNN expressed a 
high number of CRs, but this change impacted the extinction phase, where 
suppression of CRs was required. 

In the case of a short washout, the parameters changed in a completely differ- 
ent way. First of all, it has been observed an involvement of both the cortical and 
the nuclear sites. While the cortical plasticity expressed a higher value of LTP, 
thus hindering CRs generation, the LTP and LTD constants of nuclear plastici- 
ties moved toward values that promoted CRs. As a matter of fact, the cortical 
and the nuclear mechanisms worked in opposite directions. This is the reason 
why, in the short washout case, both the learning and unlearning phases were 
impaired. The learning phase was slowed down by a weaker PF-PC response, 
while the reduced extinction was due to a higher DCN activity caused by the 
nuclear plasticities. 

We can, therefore, conclude that the duration of the washout after the TMS 
administration is a crucial variable that can change the reorganization of plas- 
ticity and neural dynamics in the cerebellum. Since the TMS induces an elec- 
trical field in the most superficial areas of the tissue, the cortical plasticity is 
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the one primarily involved, as reflected by major changes concerning its LTP 
mechanisms (increasing or decreasing the potentiation effectiveness). However, 
if sufficient recovery time is granted, the TMS effect is limited to the cerebellar 
cortex and does not interfere with deeper systems (i.e., nuclear plasticities) and 
memory transfer. Viceversa, if the inter-session pause after TMS perturbation 
is shortened, the cortical impairment in the acquisition phase triggers a com- 
pensatory effect of the nuclear plasticities, that try to favour CRs generation, 
but on a longer time-scale. T'hen, the nuclear compensation becomes an obstacle 
during the extinction phase. 

It is important to highlight that the effect of TMS is more evident for the 
short washout, both in the experimental and computational studies. In fact, 
SeSSiOn2snam and sessionopyMgs experimental data for the long washout proto- 
col show high variability during the acquisition blocks. At the same time, the 
computational model fit those blocks with less fidelity. The second EBC ses- 
sion after a long washout suffers from the interference of the neural activity and 
synaptic changes that naturally happen during one week of participants' life, 
where external stimuli and internal processes can disrupt or modify cerebellar 
memory formation and consolidation. 

This work carried out a retrospective analysis and a comparison of TMS- 
perturbed EBC paradigms that present some differences both from the experi- 
mental protocol (see Sect. 2, Table 1) and the related computational studies (see 
Sect.3, Table 2). However, we believe that this comparative analysis provides 
a summary of the mechanistic explanations that can be derived from the inter- 
pretation of the SNN model parameters, highlighting the effects of the washout 
period in studies that foresee the administration of (cerebellar) TMS on human 
participants. 


Data Availability. Datasets and Codes to reproduce the findings and fig- 
ures reported in this paper is publicly available at Harvard Dataverse (DOI: 
10.7910/DVN/9HPEV4). 
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Abstract. We are entering an age of ‘big’ computational neuroscience, 
in which neural network models are increasing in size and in numbers 
of underlying data sets. Consolidating the zoo of models into large-scale 
models simultaneously consistent with a wide range of data is only pos- 
sible through the effort of large teams, which can be spread across mul- 
tiple research institutions. To ensure that computational neuroscientists 
can build on each other's work, it is important to make models pub- 
licly available as well-documented code. This chapter describes such an 
open-source model, which relates the connectivity structure of all vision- 
related cortical areas of the macaque monkey with their resting-state 
dynamics. We give a brief overview of how to use the executable model 
specification, which employs NEST as simulation engine, and show its 
runtime scaling. The solutions found serve as an example for organizing 
the workflow of future models from the raw experimental data to the 
visualization of the results, expose the challenges, and give guidance for 
the construction of an ICT infrastructure for neuroscience. 


Keywords: Computational neuroscience * Spiking neural networks - 
Primate cortex - Simulations - Strong scaling - Reproducibility - 
Reusability - Complexity barrier 


1 Introduction 


With the availability of ever more powerful supercomputers, simulation codes 
that can efficiently make use of these resources (e.g., [1]), and large, system- 
atized data sets on brain architecture, connectivity, neuron properties, genetics, 
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transcriptomics, and receptor densities [2-10], the time is ripe for creating large- 
scale models of brain circuitry and dynamics. 

We recently published a spiking model of all vision-related areas of macaque 
cortex, relating the network structure to the multi-scale resting-state activity 
[11,12]. The model simultaneously accounts for the parallel spiking activity of 
populations of neurons and for the functional connectivity as measured with 
resting-state functional magnetic resonance imaging (fMRI). As a spiking net- 
work model with the full density of neurons and synapses in each local micro- 
circuit, yet covering a large part of the cerebral cortex, it is unique in bringing 
together realistic microscopic and macroscopic activity. 

Rather than as a finished product, the model is intended as a platform for 
further investigations and developments, for instance to study the origin of oscil- 
lations [13], to add function [14], or to go toward models of human cortex [15]. 

'To support reuse and further developments by others we have made the entire 
executable workflow available, from anatomical data to analysis and visualiza- 
tion. Here we provide a brief summary of the model, followed by an overview 
over the workflow components, along with a few typical usage examples. 

The model is implemented in NEST [16] and can be executed using a high- 
performance compute (HPC) cluster or supercomputer. We provide correspond- 
ing strong scaling results to give an indication of the necessary resources and 
optimal parallelization. 


2 Overview of the Multi-Area Model 


'The multi-area model describes all 32 vision-related areas in one hemisphere of 
macaque cortex in the parcellation of Felleman and Van Essen [17]. Each area 
is represented by a layered spiking network model of a 1 mm? microcircuit [18], 
adjusted to the area- and layer-specific numbers of neurons and laminar thick- 
nesses. Layers 2/3, 4, 5, and 6 each have an excitatory and an inhibitory population 
of integrate-and-fire neurons. To minimize downscaling distortions [19], the local 
circuits contain the full density of neurons and synapses. This brings the total size 
of the network to ~4 million neurons and ~24 billion synapses. All neurons receive 
an independent Poisson drive to represent the non-modeled parts of the brain. 

'The inter-area connectivity is based on axonal tracing data from the CoCo- 
Mac database on the existence and laminar patterns of connections [4], along 
with quantitative tracing data also indicating the numbers of source neurons in 
each area and their supragranular or infragranular location [22,23]. These data 
are complemented with statistical predictions (‘predictive connectomics’) to fill 
in the missing values, based on cortical architecture (neuron densities, laminar 
thicknesses) and inter-area distances [24]. Figurel shows the resulting connec- 
tivity at the level of areas, layers, and populations. A semi-analytical mean-field 
method adjusts the data-based connectivity slightly in order to bring the firing 
rates into biologically plausible ranges [25]. 

By increasing the synaptic strengths of the inter-area connections, slow activ- 
ity fluctuations, present in experimental recordings but not in the isolated micro- 
circuit, are reproduced. In particular, the system needs to be poised right below 
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Fig. 1. Overview of the connectivity of the multi-area model as determined from anatom- 
ical data and predictive connectomics. A Area-level connectivity. The weighted and 
directed graph of the number of outgoing synapses per neuron (out-degrees) between 
each pair of areas is clustered using the map equation method [20]. Green, early visual 
areas; dark blue, ventral stream areas; purple, frontal areas; red, dorsal stream areas; light 
red, superior temporal polysensory areas; light blue, mixed cluster. Black arrows show 
within-cluster connections, gray arrows between-cluster connections. B Population-level 
connection probabilities (the probability of at least one synapse between a pair of neu- 
rons from the given populations). C Hierarchically resolved average laminar patterns of 
the numbers of incoming synapses per neuron (in-degrees). Hierarchical relationships are 
defined based on fractions of supragranular labeled neurons (SLN) from retrograde trac- 
ing experiments [21]: feedforward, SLN > 0.65; lateral, 0.35 < SLN < 0.65; feedback, 
SLN « 0.35. The connections emanate from excitatory neurons and are sent to both 
excitatory and inhibitory neurons. For further details see [11]. 
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an instability between a low-activity and a high-activity state in order to capture 
the experimental observations. The spectrum of the fluctuations and the distribu- 
tion of single-neuron spike rates in primary visual cortex (V1) are close to those 
in lightly anesthetized macaque monkeys. At the same synaptic strengths where 
the parallel spiking activity of V1 neurons is most realistic, also the inter-area 
functional connectivity is most similar to macaque fMRI resting-state functional 
connectivity. 


3 The Multi-Area Model Workflow 


The multi-area model code is available via https:/ /inm-6.github.io/multi-area- 
model/ and covers the full digitized workflow from the raw experimental data to 
simulation, analysis, and visualization. The model can thus be cloned to obtain 
a local version, or forked to build on top of it. The implementation language 
is Python, the open-source scripting language the field of computational neuro- 
science has agreed on [26]. The online documentation provides all information 
needed to instantiate and run the model. The tool Snakemake [27] is used to 
specify the interdependencies between all the scripts and execute them in the 
right order to reproduce the figures of the papers on the model's anatomy [11], 
dynamics [12], and stabilization based on mean-field theory [25] (see Fig. 2). 
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Fig. 2. Visualization of a Snakemake workflow. The interdependencies between 
the scripts reproducing the figures in [12], visualized as a directed acyclic graph. The 
label of a node corresponds to the name of the script. 


Furthermore, if one of the files in the workflow is adjusted, Snakemake enables 
executing only that file and the ones that depend on it anew. A tutorial video 
(https://www.youtube.com/watch?v=YsH3BcyZBcU) gives a brief overview of 
the model, explains the structure of the code, and shows how to run a basic 
simulation. 
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4 Example Usage 


One main property delivered by the multi-area model is the population-, layer-, 
and area-specific connectivity for all vision-related areas in one hemisphere of 
macaque cortex. We here describe how to obtain the two available versions 
of this connectivity: 1) based directly on the anatomical data plus predic- 
tive connectomics that fills in the missing values; 2) after slight adjustments 
in the connectivity in order to arrive at plausible firing rates in the model. 
We also refer to this procedure as ‘stabilization’ because obtaining plausible 
activity entails enhancing the size of the basin of attraction of the low-activity 
state, i.e., increasing its global stability. An example how to use the mean-field 
method developed in [25] for this purpose is provided in the model repository: 
figures /SchueckerSchmidt2017/stabilization.py. The method adjusts the number 
of incoming connections per neuron (in-degree). The script exports the adjusted 
matrix of all in-degrees as a NumPy [28] file; to have the matrix properly anno- 
tated one can instantiate the MultiAreaModel class with the exported matrix 
specified in the connection parameters as K_stable. Afterwards, one can access 
the connectivity using the instantiation M of the MultiAreaModel class: M.K for 
the in-degrees or M.synapses for the total number of synapses between each pair 
of populations. To obtain the connectivity without stabilization, it is sufficient 
to instantiate the MultiAreaModel class without specifying K. stable. 

Performing a simulation of the full multi-area model requires a significant 
amount of compute resources. To allow for smaller simulations, it is possi- 
ble to simulate only a subset of the areas. In this case, the non-simulated 
areas can be replaced by Poisson processes with a specified rate. To this end, 
the options replace. non. simulated, areas and replace, cc. input, source in 
connection, params have to be set to *het, poisson, stat? and the path to a 
JSON file containing the rates of the non-simulated areas. Lastly, the simulated 
areas have to be specified as a list, for instance 


sim params['areas simulated?] = [‘V1’, 'V2?], 


before the MultiAreaModel class is instantiated. A simple example how to deploy 
a simulation is given in run example fullscale.py; the effect of replacing areas by 
Poisson processes is shown in Fig. 3. 


5 Strong Scaling 


The limiting factor dictating the necessary compute resources for simulating the 
multi-area model is the available memory. Approximately 1 TB is needed for 
instantiating the network alone. To ensure sufficient memory for the model, the 
simulation has to be distributed across multiple nodes. 

We simulated the model using NEST 2.14 [29] on the JURECA supercom- 
puter in Jülich, which provides 1872 compute nodes equipped with two Intel 
Xeon E5-2680 v3 Haswell CPUs per node. Each CPU has 12 cores running at 
2.5 GHz. The standard compute node provides 128 GB of memory of which 96 
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Fig. 3. Simulating subsets of the multi-area model. Bottom panels, spiking activ- 
ity of excitatory (blue) and inhibitory (red) neurons in all layers of area V1 where in 
A and B a subset of areas is replaced by Poisson processes. Top panels, sketches visu- 
alizing which areas were simulated (white spotlights); the colors therein correspond 
to different clusters: lower visual (green), ventral stream (dark blue), dorsal stream 
(red), superior temporal polysensory (light red), mixed cluster (light blue), and frontal 
(purple). If all areas besides V1 are replaced by Poisson processes (A), the activity 
displays no temporal structure. Simulating a subset of ten areas (B) slightly increases 
the activity but does not give rise to a temporal structure, either. Only the simulation 
of the full model (C) produces a clear temporal structure in the activity. Parameters 
identical to [12, Fig. 5]. 


A B 
24 180 
Total —- — State Propagation 
20 Ew Network Construction E NEW Delivery 
E State Propagation 7 ENS Update 
21 fa EE Communication 
5 - 
E E 
P S 
o -— 
E o 
= E 
5 
© 
o9 
æ 


108 132 


84 108 132 156 180 
Nodes Nodes 


Fig. 4. Strong scaling of the multi-area model. The contributions of different 
phases to the total simulation time and state propagation for 12 to 200 compute nodes 
for 1s of biological time. Each data point consists of three random network realiza- 
tions, with error bar showing the standard deviation of measurements. A The main 
contributions to the total time are network construction and state propagation. B The 
state propagation is dominated by three phases: the communication, update, and spike 
delivery phase. Adding more compute resources while keeping network size the same 
(strong scaling) decreases the latter two but increases the absolute and relative con- 
tribution of the communication phase. The combined contributions are minimized at 
a real-time factor of 31 (black horizontal line). Parameters identical to [12, Fig. 5]. 
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GB are guaranteed to be available for the running application. Thus, on this 
machine, the multi-area model needs at least 11 nodes. 

Having established the minimal hardware requirements, we are interested in 
the runtime of the simulation depending on the compute resources. We quantify 
the proximity of the time required for state propagation to biological real time 
by the real-time factor Tsim/Tbio. Carrying out a strong scaling experiment, the 
problem size stays fixed and we increase the number of compute nodes from 
12 to 200, thus reducing the load per node. In our simulations we use 6 MPI 
processes per node and 4 threads per MPI process, as we found this hybrid 
parallelization to perform better than other combinations of threading and MPI 
parallelism (data not shown). In particular during the construction phase, hybrid 
parallelization outperformed pure threading by a large margin. The threads are 
pinned to the cores and jemalloc is used as a memory allocator ([30], see [31] 
for the relevance of the allocator for NEST). In each run, we simulate 10s of 
biological time. 

In Fig. 4A the total runtime and its main contributions, network construc- 
tion and state propagation, are shown. The contribution of state propagation is 
averaged to 1s of biological time. The main share of the time is taken up by 
network construction. During this phase the neurons and synapses are created 
and connected, while during state propagation the dynamics of the model is sim- 
ulated. The time spent in the former phase is fixed, as it is independent of the 
specified biological time, whereas the time spent propagating the state depends 
on the specified biological time and the state of the network. Depending on the 
initial conditions and the random Poisson input, the network exhibits higher or 
lower activity, affecting the time spent propagating the state. Hence, the ratio of 
both phases should not be taken at face value. In some cases, longer simulations 
are of interest, increasing the relevance of the time spent propagating the state. 
Thus, it is interesting to know how different components of the state propagation 
algorithm contribute to this phase. 

The three main phases during state propagation are: update of neuronal 
states, communication between MPI processes, and delivery of spikes to the tar- 
get neurons within each target MPI process. Figure 4B shows the contributions 
of these phases to the real-time factor. Adding more compute resources brings 
down the contributions of the update and delivery phases and increases the 
time consumption of communication. Especially the delivery of spikes is heav- 
ily dependent on the network activity. At 160 nodes, a real-time factor of 31 is 
achieved (mean spike rate 14.1spikes/s). This slowdown compared to real time 
enables researchers to study the dynamics of the multi-area model over suffi- 
ciently long periods, for example in detailed correlation analyses, but systematic 
investigations of plasticity and learning would still profit from further progress. 

In order to test the influence of the communication rate on the time required 
for state propagation, we carried out a simulation of a two-population balanced 
random network [32] which has been used in previous publications on neuronal 
network simulation technology [1,31,33-36]. We use the same parameters as in 
[1], but replace connections governed by spike-timing dependent plasticity by 
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static connections. In addition we set the numbers of neurons and synapses to 
match those in the multi-area model (resulting mean spike rate 12.9 spikes/s). 
The communication interval is determined by the minimum delay, as the spikes 
can be buffered over the duration of this delay while maintaining causality [37]. 
The multi-area model has a minimum delay of 0.1 ms, whereas the balanced 
random network has a uniform delay of 1.5 ms, so that communication occurs 15 
times less often. Using 160 nodes and the same configurations as before, we find 
a real-time factor of 17. Here, 80% of the time is spent delivering the spikes to 
the target neurons locally on each process, whereas only 1% of the time is spent 
on MPI communication. Forcing the two-population model to communicate in 
0.1ms intervals by adding a synapse of corresponding delay and zero weight 
indeed requires the same absolute time for MPI communication as in the multi- 
area model. The real-time factor increases to 34, significantly larger than for the 
multi-area model. The increase is entirely due to a longer spike delivery phase. 
How the efficiency of spike delivery is determined by the network activity remains 
to be answered by future investigations. Possibly relevant factors are the wide 
distribution of spike rates in the multi-area model compared to the narrow one in 
the two-population model, and the different synchronization patterns of neuronal 
activity in the two models. In summary, less frequent MPI communication shifts 
the bottleneck to another software component while almost halving the total 
runtime. This opens up the possibility of speeding up the simulation even more 
through optimized algorithms for spike delivery on each target process. 


6 Conclusions 


The usefulness of large-scale data-driven brain models is often questioned [38— 
40], as their high complexity limits ready insights into mechanisms underlying 
their dynamics, large numbers of parameters and a lack of testing of models with 
new data may lead to overfitting and poor generalization, and function does not 
emerge magically by putting the microscopic building blocks together. However, 
this argument can also be turned around. It seems that in recent years the com- 
plexity of the majority of models and thereby their scope of explained brain 
functions is not increasing anymore. One reason is that elegant publications on 
minimal models explaining a single phenomenon are often also end points in 
that they have no explanatory power beyond their immediate scope. It remains 
unclear how the proposed mechanisms interact with other mechanisms realized 
in the same brain structure, and how such models can be used as building blocks 
for larger models giving a more complete picture of brain function. The powerful 
approach of minimal models from physics needs to be integrated with the sys- 
tems perspective of biology. To achieve models able to make accurate predictions 
for a broad range of questions, the zoo of available models of individual brain 
regions and hypothesized mechanisms needs to be consolidated into large-scale 
models tested on numerous benchmarks on multiple scales [41]. Having an accu- 
rate, if complex, model of the brain that generates reliable predictions enables in 
silico experiments, for instance to predict treatment outcomes for neurological 
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conditions, potentially even for individual subjects [42]. Furthermore, combin- 
ing the bottom-up, data-driven approach with a top-down, functional approach 
allows models to be equipped with information processing capabilities. Creating 
such accurate, integrative models will require overcoming the complexity barrier 
computational neuroscience is facing. Without progress in the software tools 
supporting collaborative model development and the expressive digital repre- 
sentation of models and the required workflows, reproducibility and reusability 
cannot be maintained for more complex models. 

On the technical side, simulation codes like NEST have matured to generic 
simulation engines for a wide range of models. Recent developments in the simu- 
lation technology of NEST have considerably sped up the state propagation and 
reduced the memory footprint [1] of large-scale network models. The rapid state 
propagation causes the network construction phase to take up a large fraction of 
the simulation time for simulations of short to medium duration. Furthermore, 
the fact that hybrid parallelization currently performs better than pure thread- 
ing during the construction phase indicates that the code still spends time on 
the Python interpreter level and does not yet optimally make use of memory 
locality. For these reasons, speeding up network construction should be a focus 
of future work. 

Our strong scaling results show that communication starts to dominate at 
an intermediate number of nodes, so that the further speed-up in the solution of 
the neuron equations cannot be fully exploited. Therefore, it would be desirable 
to develop methods for further limiting the time required for communication, 
for instance by distributing the neurons across the processes according to the 
modular structure of the neuronal network [43], as opposed to the current round- 
robin distribution. The longer delays between areas compared to within areas 
would then allow less frequent MPI communication, by buffering the spikes for 
the duration of the delay [37,43]. A major fraction of time is then spent in the 
spike delivery phase. Here an algorithm needs to transfer the spikes arriving at 
the compute node to their target neurons. It is our hope that in future a better 
understanding of the interplay between the intrinsically random access pattern 
and memory architecture will lead to more effective algorithms. 

While the publication of the model code in a public repository enables down- 
loading and executing the code, this requires setting up the simulation on the 
chosen HPC system, which may be nontrivial, and the HPC resources have to 
be available to the research group in the first place. Therefore, it would be 
desirable to link computing resources to the repository, enabling the code to be 
executed directly from it. The ICT infrastructure for neuroscience (EBRAINS) 
being created by the European Human Brain Project (HBP) has made first steps 
in this direction. A preliminary version of a digital workflow for the collaborative 
development of reproducible and reusable models was evaluated in [44]. Next to 
finding a concrete solution for the multi-area model at hand, the purpose of the 
present study was to extend the previous work and obtain a clearer picture of the 
requirements on collaborative model development and the digital representation 
of workflows. From the present perspective it seems effective not to reimplement 
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the functionality of advanced code development platforms like GitHub in the 
HBP infrastructure but to build a bridge enabling execution of the models and 
storage of the results. An essential feature will be that the model repository 
remains portable by abstractions from any machine specific instructions and 
authorization information. 

The microcircuit building block for this model [18] has found strong resonance 
in the computational neuroscience community, having already inspired multiple 
follow-up studies [45-52]. The multi-area model of monkey cortex developed by 
Schmidt et al. [11,12] and described here has a somewhat higher threshold for 
reuse, due to its greater complexity and specificity. Nevertheless, it has already 
been ported to a single GPU using connectivity generated on the fly each time a 
spike is triggered, thereby trading memory storage and retrieval for computation, 
which is possible in this case because the synapses are static [53]. We hope 
that the technologies presented here push the complexity barrier of neuroscience 
modeling a bit further out, such that the model will find a wide uptake and 
serve as a scaffold for generating an ever more complete and realistic picture of 
cortical structure, dynamics, and function. 
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Abstract. The precise simulation of the human brain requires coupling 
different models in order to cover the different physiological and func- 
tional aspects of this extremely complex organ. Each of this brain mod- 
els is implemented following specific mathematical and programming 
approaches, potentially leading to diverging computational behaviour 
and requirements. Such situation is the typical use case that can benefit 
from the Modular Supercomputing Architecture (MSA), which organizes 
heterogeneous computing resources at system level. This architecture 
and its corresponding software environment enable to run each part of 
an application or a workflow on the best suited hardware. 

This paper presents the MSA concept covering current hardware and 
software implementations, and describes how the neuroscientific work- 
flow resulting of coupling the codes NEST and Arbor is being prepared 
to exploit the MSA. 


Keywords: Modular Supercomputing Architecture - MSA - 
Heterogeneous computer architectures - DEEP projects * Accelerators - 
Workflow * Neuroscience * Arbor + NEST 


1 Introduction 


Since the construction of the first cluster computer in the nineties [1], intercon- 
necting a large number of commodity, general-purpose processors has become the 
most popular approach to build High-Performance Computing (HPC) systems. 
In recent years, these traditionally homogeneous clusters are being substituted 
by heterogeneous configurations employing a variety of acceleration technologies. 
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Computing devices are considered as accelerators when they have been 
designed to perform specific operations very fast. In principle, such definition 
would apply even to individual execution units within the CPU, such as tensor 
cores or advanced vector registers. However, in this paper we denote as accelera- 
tors only those out-of-package devices made of a very large number of relatively 
simple compute cores. Under this definition, the most frequently used example 
of an accelerator is a graphic processing unit (GPU). As their name says, GPUs 
were originally developed to very efficiently render and visualize graphics, but 
today their compute performance is also employed to perform floating point and 
tensor operations in all kinds of applications. Since accelerators are designed 
to execute auxiliary operations, they frequently depend on a CPU (considered 
as its host) to carry out important actions such as booting the accelerator and 
enabling it to communicate through the cluster-level high-speed network. 

Accelerators rely on exploiting parallelism to compute, as their large number 
of compute cores/units are operated at relatively low frequencies. In consequence, 
they are able to achieve high peak-performances using less power than standard 
CPUs. Their energy efficiency — expressed through a high Flop/Watt ratio — is in 
fact the main reason for the success of accelerators in HPC. Clusters with accel- 
erators are generally more energy-efficient than those without, and this difference 
becomes a major cost-factor in the operation of very large-scale platforms. 

With regards to the system-level architectures of accelerated clusters, one 
can observe that typically one, two or more accelerators are integrated inside 
a node connected to a general purpose CPU via a PCle interface. This node 
configuration is then multiplied several thousands of times, and the CPUs are 
interconnected with each other via some high-speed network. Recently, intercon- 
nection of the GPUs within the node has become possible, improving the ability 
to exchange data between them. In consequence the trend goes towards GPU- 
islands with four, six or even more GPUs per node. The negative consequence is 
a dramatic growth in compute power inside the accelerated-node, which is not 
compensated with a proportional increase in inter-node communication band- 
width. Therefore, though the scaling inside the node improves, the system-level 
scaling of codes is impeded by the imbalanced compute-to-communication capa- 
bilities between nodes. 

The traditional programming model for node-level accelerated clusters is to 
run the main program in the host CPUs and offload compute-intensive kernels to 
the accelerators. For the large problems tackled by HPC, multi-node executions 
require exchanging data between the parts of the application running on the host 
and the accelerator. However, the static assignment of accelerators to CPUs 
within the node, added to the additional communication latencies that apply 
when inter-accelerator communication is executed through the host — what today 
is not always necessarily the case anymore ~, limits the scalability and flexibility 
of this node-level heterogeneous cluster concept. 

For example: an application running on a cluster with four GPUs per node 
might fully exploit the host CPUs but use only one of the GPUs attached to 
each one of them. In such case it will be hardly possible for other applications 
to use these free devices, as access needs to go through the CPU, which is busy. 
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In consequence, in this example almost 3/4 of the computation power of the 
system will remain idle. Obviously such situation would be avoided if the system 
is deployed with exactly the amount of accelerators per node required by the 
applications that will run on it. However, finding the right static configuration 
for all the users of the HPC system becomes impossible since the application 
portfolio is getting more and more diverse. 

Today, the users of HPC systems employ codes ranging from high-scale, 
tightly coupled simulations, to high-performance data analytics (HPDA) and 
deep learning (DL). In fact, not only the applications are very different from 
each other, but even the workflows from individual users combine codes with 
very diverse requirements. 

This is particularly the case in neuroscience, which aims at better under- 
standing the behaviour of the possibly most complex organ in nature: the human 
brain. The huge scale spans (from nanometers to centimeters), the complexity of 
the involved physical and biological effects, and the tight interrelation between 
all of these aspects require the combination of various codes in order to reproduce 
the behaviour of the brain with some accuracy. All these codes present generally 
different requirements, making the overall usage scenario a natural candidate for 
using the a Modular Supercomputing Architecture (MSA). 

'The particular case addressed by this paper is the coupling of NEST and 
Arbor, two neuroscientific codes that can together bring a deep insight in the 
functions of the human brain. NEST simulations of large-scale networks of simple 
integrate-and-fire type model neurons are memory-bound due to the communica- 
tion and memory accesses required to reproduce the exchange of neuronal signals, 
which dominate the total runtime [23]. Therefore, NEST runs best on general 
purpose clusters. On the other hand, Arbor simulates multi-compartment neu- 
rons with a very high computational cost per neuron and is therefore compute- 
bound, making it the ideal candidate to run on accelerators. The coupling of both 
codes could therefore profit from an MSA system in which CPU and accelerator 
resources can be reserved and allocated independently. 

This paper explains the MSA and how a neuroscientific workflow combining 
the codes NEST and Arbor aims at employing it. Section 2 explains the architec- 
ture concept, while Sect. 3 and 4 describe current hardware platforms and their 
software environment. Section 5 describes the above-mentioned Nest-Arbor neu- 
roscientific workflow, and the paper is summarized in Sect. 6. 


2 'The Modular Supercomputing Architecture (MSA) 


The Modular Supercomputing Architecture (MSA) developed at the Jülich 
Supercomputing Centre (JSC) within the series of EU-funded DEEP projects [2, 
3] aims at providing cost-effective computing capabilities fitting the needs of a 
wide range of computational sciences [4,5]. 

The MSA segregates heterogeneous resources and implements heterogeneity 
at system level, instead of node level (see Fig. 1). In its simplest configuration (the 
so-called cluster-booster approach [6]), a cluster made of general purpose CPUs 
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Fig. 1. Sketch of the modular supercomputing architecture. Note that this diagram 
reflects the general concept and does not represent any specific computer. Example 
modules: Cluster (CN: Cluster node), Booster (BN: Booster node), Data Analytics 
(AN: Analytics node), Neuromorphic (NN: Neuromorphic node), and Quantum (QN: 
Quantum node). For a schema of the MSA realization in the DEEP-EST prototype see 
Fig. 3, left. 


is attached to a cluster of accelerators (the booster). In the latter accelerators are 
considered and operated as first-class computing devices. Furthermore, nodes on 
cluster and booster can be allocated independently and according to the needs 
of each application, so that no resources are blocked by allocating others. 

In the first hardware realizations of the cluster-booster concept (e.g. 
JURECA, see Sect. 3.1), the booster used many-core processors that could boot 
and communicate through the system-level network without relying on host- 
CPUs. Fully autonomous accelerators are ideal for the MSA, as they enable 
scaling-up the cluster and the booster independently. In particular, the energy- 
efficient booster can be built at very large size (e.g. exascale), while the cluster 
is kept at a relatively small size to cover the needs of low-scaling parts of the 
applications without impacting on the overall power consumption of the sys- 
tem. Note that this is not possible in the traditional accelerated-node approach, 
where increasing the number of accelerators implies a proportional increase in 
the amount of general-purpose CPUs due to the static assignment between both. 

Unfortunately, today's GPUs still rely on a host-CPU and cannot be operated 
autonomously. Still, the booster philosophy can be kept if one employs a low-power 
(and computationally weak) CPU, reducing its role to the orchestration and oper- 
ation of the attached GPU(s). In this case, even if the number of CPUs increases 
when scaling-up the system, their contribution to the overall power consumption 
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is very small. Another key goal of the booster is achieving a good intra-node and 
inter-node communication-to-computation balance. If the selected GPU is com- 
putationally very strong, it might be beneficial to keep a low number of them per 
node (eventually only one), in order to fully exploit all its bandwidth towards the 
network. These kind of considerations are crucial to achieve the goal of the booster: 
good system balance and energy-efficient scalability at system level. 

Note, however, that the MSA is much more general than the cluster-booster 
concept, which is very much focused on matching the different concurrency levels 
in applications. In the same way as the cluster provides hardware support to 
run the low/medium scalable part of codes while the booster does the same for 
highly-scalable codes, some applications need different acceleration technologies 
and varying sizes of memory devices and capacities. The MSA aims at fulfilling 
the needs of very diverse application requirements by interconnecting a variety 
of compute modules. Each module is a multi-node system of potentially large 
side, designed with specific hardware configurations that target a part of the 
application portfolio. 


High-scale 
Simulation 
workflow 


Module 6 
Multi-tier Storage 
System 


Deep. Data Analytics 
Learning workflow 
workflow 


Fig. 2. Distribution of three different (hypothetical) workflows on an MSA system. 
See in Fig. 5 the mapping on the DEEP-EST prototype of the neuroscientific workflow 
matter of this paper. 


One of the goals with MSA is to enable application developers to distribute 
their codes over a diversity of modules, such that each part of their workflows 
runs on the most suitable hardware (see Fig.2). A further goal is to facilitate 
the adoption of new computing technologies in HPC. Therefore - though not yet 
implemented in existing platforms — the concept includes the option of integrat- 
ing future technologies such as neuromorphic and quantum computing, providing 
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seamless integration into a traditional HPC environment in order to enable their 
use in scientific workflows. 


3 Current Hardware Platforms 


Several MSA platforms have been deployed at JSC. Here two systems are 
described, showing how the architecture itself evolves with time employing the 
newest available technologies. 


3.1 JURECA Cluster-Booster 


In November 2017, with the deployment of its booster module, the JURECA 
cluster-booster system became the first modular supercomputer worldwide to 
be listed in the Top 500 list, reaching position 29 with the Linpack benchmark 
running over both partitions [7]. Both modules can obviously be operated sepa- 
rately, but what makes JURECA unique is that complex applications can also 
run across both, using it as one MSA system. 

While the cluster uses multicore processors (Intel Xeon Haswell) and 
100 Gb/s Mellanox (EDR) InfiniBand, the booster uses multi-core processors 
(Intel Xeon Phi KNL) and 100 Gb/s Intel Omni-Path. Bridging the two dif- 
ferent high-speed network technologies is possible in JURECA through a cus- 
tomized development in the ParTec ParaStation Software Suite [8,9], which is 
continuously researched and optimized. 


3.2 DEEP-EST Prototype 


The DEEP-EST project has built an MSA-prototype with three compute mod- 
ules: cluster module (CM), extreme scale booster (ESB), and data analytics 
module (DAM) - see Fig.3. The main hardware characteristics of each module 
are detailed in Table 1. It is worth noting that, unlike in JURECA, the DEEP- 
EST booster is built using an GPU attached to an x86 CPU. As mentioned in 
Sect.2, the role of this host-CPU is reduced to an auxiliary function and it is 
not intended to employ it for application computing. 

The DAM module is intended to run the parts of applications dealing with large 
amounts of data. Therefore, the DAM is provided with very large memory capac- 
ity, combining both volatile and non-volatile technologies. Codes that can particu- 
larly profit from such capabilities are those performing data-analytics, or running 
machine learning or deep learning algorithms. The latter benefit also from accel- 
eration devices containing tensor cores and support for mixed precision. For this 
reason, the DAM contains both NVIDIA GPUs and Intel Stratix10 FPGA units. 
With its variety of components the DAM is the module offering maximal flexibil- 
ity. This comes at the prize of a higher energy consumption. However, since the 
DAM is only used for small-scale problems its node-number can be kept low. 

Additional to the three compute modules, the DEEP-EST prototype contains 
two storage modules: the all-flash storage module (AFSM) and the hard-disk 
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Fig. 3. Schema and picture of the DEEP-EST prototype at JSC (as of March 2021, 
fully installed). 


Table 1. Key hardware features of the DEEP-EST MSA prototype. 


DEEP-EST CM DAM ESB 

Time of deployment | 2019 2019 2020 

Node count 50 16 75 

CPU type Intel Xeon 6146 Intel Xeon 8260M Intel Xeon 4215 

CPU codename Skylake Cascade Lake Cascade Lake 

Cores @ frequency 12 @ 3.2 GHz 24 @ 2.4GHz 8 @ 2.5GHz 

Accelerators per node | n.a 1x Nvidia V100 GPU 1x Nvidia V100 GPU 
1x Intel Stratix10 FPGA 

DDRA4 capacity 192 GB 384 GB 48 GB 

HBM capacity na 32 GB (GPU) 32 GB (GPU) 

Node max. mem BW | 256 GB/s 900 Gb/s (GPU) 900 GB/s (GPU) 

NVM capacity n.a 2/3 TB n.a 

NVMe/SATA SSD 512 GB 3 TB 250 GB 

Power/node 500 W 1600 W 500 W 

Cooling warm-water Air Warm-water 

Network technology | EDR-IB (100 Gb/s) | EXTOLL (100 Gb/s) EDR-IB (100 Gb/s) 
Ethernet (40 Gb/s) EXTOLL (100 Gb/s) 

Topology Fat-tree Switched 2D-torus tree/grid 


based scalable storage module (SSSM), to enable fast I/O, run the file system 
and provide external storage capabilities. 

All the compute and storage modules have been already installed and are up- 
and-running at JSC. The DEEP-EST prototype continues in operation beyond 
the end of the EU-funding time-frame and runs in near-production environment. 
It is being used for further development of the software stack and programming 
model of MSA systems in the DEEP-SEA project [2], as well as to run applica- 
tion tests to evaluate the benefits of its architecture and the functionality of its 
software stack. 
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4 Software Environment 


The previous section showed how the MSA can be realized with very different 
hardware components. In fact, one could consider any heterogeneous computer as 
an MSA-system, as long as it can be operated in such a way that individual appli- 
cations can run over various kinds of nodes, and these can be scheduled and allo- 
cated according to diverse application needs. Therefore, one could argue that the 
MSA is more a software infrastructure enabling the dynamic operation of a het- 
erogeneous computer, rather than the hardware architecture of the system itself. 

The MSA software stack enables application developers to map the intrinsic 
scalability patterns of their applications and workflows onto the hardware: highly 
parallel code parts run on the large-scale, energy-efficient booster, while less 
scalable code parts can profit from the high single-thread performance of the 
cluster, or from the high memory capacity of the data-analytics module. Users 
can freely decide how many nodes to use in each module, so that the highest 
application efficiency and system usage can be achieved [10]. 


4.1 Scheduling 


The scheduling software used in the current MSA systems is SLURM [11]. Hard- 
ware heterogeneity is supported with SLURM's job-pack functionality, which 
provides semantics to express the amount of nodes to be reserved in each par- 
tition of an heterogeneous platform. The same annotation enables a user to run 
his/her workflow across nodes on different modules of an MSA system. However, 
in its current implementation SLURM statically reserves all nodes for the whole 
duration of the job, regardless of the fact that they are continuously used or 
not. For example, for a workflow performing pre-processing on the cluster and 
running then a long simulation on the booster, the nodes on the cluster will 
be kept reserved (and idle) until the simulation finishes in the booster. This is 
certainly not the wished behaviour on the MSA. 

Extensions to the SLURM scheduler are therefore being implemented, aiming 
at reserving and releasing nodes for a job-pack when necessary. The DEEP-EST 
implementation relies on a new ---delay switch, which can be used to inform 
the scheduler of the time-span between the start of one and the next job in a 
job-pack. Based on this information, the reservation of the second module can 
be started when it is actually needed, and not before. Further extensions to the 
scheduling and resource managing system for MSA are envisioned within the 
DEEP-SEA project. 


4.2 Programming Environment 


In order to facilitate portability, the MSA software stack aims at supporting 
the hardware complexity, while providing all the needed functionality and facing 
the application developers with the well-established interfaces and programming 
models that they know and use in other HPC systems. Therefore, the MSA 
programming paradigm is based on MPI. To support MSA-systems employing 
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different network technologies in different modules (such as JURECA) a gateway 
protocol has been developed [12]. For application developers, this protocol is fully 
transparent and hidden behind the MPI library. 

The simplest way of running an application on an MSA system is using only 
one module. Monolithic, highly scalable codes will actually likely run this way. 
On the other hand, codes that perform multi-physics or multi-scale simulations 
can run across compute modules and exchange data between them via MPI. 
This is the scenario displayed in Fig. 4, where an MPI application running on 
the cluster spawns part of its code to the booster. 


Inter-Communicator 


MPI_COMM_WORLD 
(B) 


MPI_COMM_WORLD 
(A) 


Cluster Booster 


Fig. 4. Scheme of an MPI application running on the cluster and spawning part of its 
code to the booster. 


MPI codes can be distributed over the MSA employing any of the collective 
instructions in the MPI standard allowing to connect two MPI_Comm_World() 
with each other. For instance, a subset of MPI tasks can collectively spawn a new 
MPI Comm World() to another module via the instruction MPI_Comm_Spawn(). Its 
inter-communicator connects the children to the parent processes and enables 
transferring data between them. Similarly, two MPI_Comm_World() running on 
different MSA modules can be connected with each other via the instruction 
MPI_Connect(). Furthermore, an MPI_Comm_World() can be split into two by 
using MPI_Split() and then send each one to a different module. 

Arguably, splitting an MPI programm across modules is the most difficult way 
of using an MSA system. Distributing workflows is much simpler since one does 
not need to take care of the MPI communicators. In general, workflows use differ- 
ent codes used to execute different actions after (or in parallel) to each other. For 
example, a user may need to pre-process data before running a long simulation, 
then perform data-reduction, and ultimately visualize the final result. Running 
these codes on different modules consists simply on indicating to the scheduler on 
which nodes to execute each step. In the SLURM language a workflow is named 
a job-pack (see Sect. 4.1), and a set of SLURM instructions enables running each 
step on a different partition of module of an heterogeneous system. 

Between the codes of a workflow data is currently transferred via the file- 
system, which means that it is written onto the external storage in one step, 
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and then re-read in the next workflow-step. Taking into account the time and 
power consumed in such write-read operations, this approach is not necessarily 
the fastest, and certainly not the most scalable. Because of that, the DEEP-EST 
project has investigated the potential implementation and benefits of directly 
transferring data between workflow steps via MPI. The data can reside directly 
at the node-memories or be stored in new kind of network-attached memory 
devices [13]. 


5 Neuroscience Workflow on MSA 


Computational neuroscience, in its attempt to better understand the human 
brain via simulations, uses both multi-scale applications and complex workflows, 
and should therefore profit from the MSA concept. To prove it, the DEEP-EST 
project has studied the use of MSA for a neuroscientific workflow in which NEST 
and Arbor (described in Sects.5.1 and 5.2, respectively) are the main compo- 
nents. A schema of the workflow distribution on the DEEP-EST prototype is 
given in Fig. 5. 


Neuroscientific 
workflow —. 


Fig. 5. Distribution of the neuroscientific workflow (NEST, Arbor) on the architecture 
of the DEEP-EST prototype. 


Brain function involves the interaction of neurons located in different brain 
areas. Therefore, spiking neural network models representing multiple brain 
areas are becoming more and more popular in Computational Neuroscience. The 
multi-area model [14] is an early brain-scale model at the resolution of single neu- 
rons that incorporates experimental data defining the connections between neu- 
rons, called synapses. It comprises approximately four million highly simplified 
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model neurons and on average 6000 incoming synapses per neuron. Recording 
all neuronal activity — called spikes — for the entire duration of a simulation 
could easily reach the terabyte data-volume range. Interpreting such data in a 
meaningful way is even more challenging, also because experimental techniques 
currently can only record spike activity of a small proportion of neurons in a 
brain area, limiting the experimental data available for comparison to simula- 
tion results. 

Neuroscientists therefore also record mesoscale signals such as local field 
potentials (LFP): A single micro-electrode or an array of micro-electrodes is 
inserted into the brain tissue in order to record the electrical activity, especially 
input currents, of all neurons in a volume roughly 1 mm in diameter. Due to the 
use of highly simplified point neurons in, e.g., the multi-area model [14], LFP 
signals cannot be obtained directly from simulations of that or similar models. 

The Python package LFPy! enables the calculation of local field potentials 
by driving simulations of uncoupled compartmental neuron models by the spike 
output of point neuron network models [16]. This multi-scale simulation thus 
allows for the comparison of LFP signals from brain-scale network models to 
experimental data. As the simulation of brain-scale point neuron networks and 
uncoupled compartmental neuron models create very different computational 
loads, the prediction of LFPs from brain-scale models presents an important 
neuroscience application for MSA systems such as the DEEP-EST prototype 
(see Sect. 3.2). 

With this target neuroscience application in mind we have investigated the 
limits to NEST-Arbor co-simulation on the DEEP-EST prototype. Figure 6 illus- 
trates the concept, while Sect. 5.1 and Sect. 5.2 describe the involved simulators 
NEST? and Arbor?, respectively. The overall runtime of the multi-scale simula- 
tion depends on the individual runtimes of the simulators and the latency of the 
frequent collective MPI communication between CM and ESB. 


5.1 NEST 


NEST is a simulator for spiking neural network models that focuses on the 
dynamics, size and structure of neural systems. In such networks neuron models 
are typically simple: they do not account for any neuronal morphology and the 
dynamics is governed by a small number of coupled differential equations, which 
in some cases can even be exactly integrated [17]. This enables the simulation 
of large-scale networks, where each compute node hosts many neurons and their 
incoming synapses. As in biologically realistic models of the cortex each neuron 
connects to a few thousand other neurons, an inherent bottleneck of the simula- 
tion of such networks is the frequent communication of neuronal signals (spikes) 
between compute nodes and the delivery of the spikes to their local targets. 
Large-scale neural network simulations with NEST make use of a hybrid par- 
allelization scheme combining MPI and OpenMP threads, where users typically 


1 lfpy.readthedocs.io; github.com/LFPy/LFPy. 
? nest-simulator.org; nest-simulator.readthedocs.io; github.com/nest /nest-simulator. 
3 arbor.readthedocs.io; github.com/arbor-sim/arbor. 
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Arbor/LFPy 


Cluster Module (CM) 


Fig. 6. Multi-scale simulation of a brain-scale network and concurrent calculation of 
LFPs as the target neuroscience application on the DEEP-EST prototype, requiring 
frequent transfer of neuronal activity data from the cluster module (CM) to the extreme 
scale booster (ESB) through collective MPI communication. Right: Simulation of the 
multi-area model [14] at single-neuron level resolution on the CM using NEST. Each 
area is represented by an adapted microcircuit model [15] with area- and layer-specific 
population sizes. Blue triangles and red dots in the magnified microcircuit-model illus- 
tration indicate two different types of neurons and their varying population sizes across 
layers. Connectivity between areas is based on experimental data and varies depending 
on source and target area as indicated by the connectivity matrix. Adapted from Fig. 1 
and Fig. 4D in [14]. Left: Simulation of one of the areas at sub-neuronal resolution using 
Arbor on the ESB, and continuous calculation of LFPs using LFPy. Morphologies of 
the multi-compartment neuron models are based on experimental data. Neurons are 
not connected as all spike input is obtained from the multi-area model simulation on 
CM. Adapted from Fig. 1 in [16] 


define the network models and steer the simulations through the Python based 
interface PyNEST [18]. A variety of neuron and synapse models are already 
included in NEST but it also offers the possibility to define custom models 
using the domain specific language NESTML [19]. NEST has an interface to 
the Multi-Simulator Coordinator (MUSIC) [20], which enables multi-scale sim- 
ulations. Besides, NEST’s refactored recording infrastructure [21] facilitates the 
definition of communication interfaces to other simulators such as Arbor. The 
NEST code is open source. All contributions to the code-base undergo review 
and are automatically tested by a continuous integration system running the 
NEST testsuite [22]. 

NEST is a simulator with versatile applications: from interactive explorations 
of small-scale networks on laptops to simulations of brain-scale networks on 
supercomputers. With the introduction of the 5g simulation kernel [23,24] the 
scalability of NEST has extended even further with respect to both runtime and 
memory usage, see Fig. 5.1. 
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Fig. 7. Weak scaling of the NEST HPC benchmark on JUQUEEN for the current and 
the previous simulation kernel (NEST 5g and 4g, respectively). CPU time and memory 
usage per compute node for a network simulation for 1s of biological real time, where 
each compute node hosts 18,000 neurons with 11, 250 incoming synapses per neuron; 
6496 of all synapses have dynamically changing weights. Simulations were performed 
using 1 MPI process per compute node and 8 threads per MPI process. Adapted from 
Fig. 7 in [23]. 


'The roadmap for the development of the simulation technology is defined 
by the NEST Initiative^. Current work comprises performance profiling and 
redesign of the algorithms underlying spike communication and spike delivery, 
and the development of more efficient ways of handling neuronal populations. 
'This enables faster construction and simulation of highly structured networks 
such as the multi-area model (Fig. 7). 


5.2 Arbor 


Arbor is a performance-portable library for simulation of large networks of 
morphologically-detailed neurons on modern high-performance computing sys- 
tems [25,26]. Arbor simulates networks of spiking neurons, particularly multi- 
compartment neurons. In these networks, the interaction between cells is con- 
veyed by spikes and gap junctions and the multi-compartment neurons are char- 
acterized by axonal delays, synaptic functions and cable trees. Each cell is mod- 
elled as a branching, one-dimensional electrical system with dynamics derived 
from the balance of transmembrane currents with axial currents that travel 
through the intracellular medium, and with ion channels and synapses repre- 
sented by additional current sources. Arbor additionally models leaky integrate- 
and-fire cells and proxy spike sources. 

The Arbor library is an active open source project, written in C++14 and 
CUDA using an open development model. It can scale from laptops to the largest 
HPC clusters using MPI. The on-node implementation is specialized for GPUs, 
vectorized multicore, and Intel KNL with a modular design for enabling exten- 
sibility to new computer architectures, and employs specific optimizations for 
these GPU and CPU implementations. The GPU is deployed for updating cur- 
rents and integrating gating variables using an optimized parallel GPU solver for 


^ nest-initiative.org. 
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sparse matrices with an optimized memory layout and reduced memory access. 
In detail, the GPU solver uses fine grained parallelization with one dendrite 
branch per thread, and a cell distribution into CUDA blocks to avoid global 
synchronization. In the simulation setup work balancing per thread avoids idle 
threads by sorting all submatrices on a level in a block by size. To maximize 
the utilization of the GPU memory bandwidth the memory layout is optimized 
by storing data in an interleaved format for each branch. Memory read access is 
reduced by storing only one parent compartment for each branch (Fig. 8). 


wall time (s) 


cells 


Fig. 8. Performance of Arbor. Left: Single node wall time of Arbor running on Piz Daint 
multicore, GPU and Tave KNL. Right: The single node speedup of Arbor running on 
Piz Daint multicore and GPU relative to NEURON on multicore. Adapted from Fig. 5 
in [14]. 


By implementing the design goals of scalability, extensibility and performance 
portability, Arbor is an order of magnitude more efficient than existing simula- 
tion engines [25]. Arbor does this without sacrificing ease of use and flexibility. 
Arbor’s single node speedup performance has been analyzed using a randomly 
connected network benchmark employing CSCS’ Piz Daint multicore, GPU and 
KNL clusters. For more than 4,000 cells the GPU is utilized enough to run the 
benchmark more efficiently in terms of the wall time than on multicore or KNL 
(Fig. 5.2, left panel). Compared to NEURON [27], the most widely used software 
for general simulation of networks of multi-compartment cells, Arbor is 5-10x 
faster for fewer than 128 cells, and for more than 256 cells it is over 20x faster 
(Fig. 5.2, right panel). 

Benchmarking and validation of Arbor and other simulators can be performed 
with the NSuite performance and validation testing suite? which is on-going 
work in Arbor development. Full support for the SONATA [28] model exchange 
format is under active development, as well as a Python API. Arbor will provide 
APIs for integration with other tools and simulators, including co-simulation 
with NEST. On a technical level a NEST-Arbor two-way coupled co-simulation 
imply some specific challenges, e.g., enabling injection of external spikes, as well 
as new initiation steps to align time delays and the number of external cells. 


? nsuite.readthedocs.io; github.com/arbor-sim/nsuite. 
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6 Summary 


The Modular Supercomputing Architecture (MSA) proposes a new philosophy 
for the integration and use of heterogeneous computing resources. Instead of 
regarding acceleration devices as intra-node entities and using them to speed-up 
very concrete kernels of the codes executed on general-purpose host-CPUs, the 
MSA strives for operating accelerators as first-class, autonomous compute ele- 
ments. The MSA segregates the heterogeneous resources into individual modules, 
each one being a multi-node platform of potentially large size tailored to specific 
kinds or parts of applications. Each module can be sized differently, according 
to the energy efficiency of the hardware and the needs of the users. At the same 
time, applications and workflows can be distributed over different modules using 
the overarching MSA software environment, enabling each step to be executed 
on the best suited hardware. 

The field of Computational Neuroscience is already preparing to employ the 
MSA approach, targeting the DEEP-EST prototype with a workflow that com- 
bines the codes NEST and Arbor. This multi-scale neuronal network simula- 
tion connects two types of neuronal simulations that are fundamentally different 
in computational load, memory access behaviour, and communication require- 
ments. 

A simulation of the multi-area model with NEST is not particularly com- 
putationally costly as it involves only the update of simple model neurons. In 
such large-scale network simulations it is rather the frequent and unpredictable 
exchange of neuronal signals that imposes stress on MPI communication and 
memory access, and thus dominates the total runtime. The cluster module is 
therefore best suited for this type of simulation. 

For the Arbor simulation of multi-compartment neurons the computational 
costs per neuron are much higher than for a large-scale point-neuron network 
simulated with NEST. At the same time, the communication of spikes is of 
minor importance for the overall runtime of the Arbor simulation because it is 
much smaller in terms of number of neurons. In the planned application, the 
compartmental neurons are not even connected and communication of neuronal 
signals within Arbor is thus not required. Therefore, the Arbor simulation is more 
compute bound and can benefit from the GPUs of the DEEP-EST booster. 

The installation of the DEEP-EST prototype has been completed with the 
deployment of its third and last module — the booster — in early 2020. NEST and 
Arbor have been adapted to run on the prototype, to show the benefits of execut- 
ing an important neuroscientific workflow across the modules of an MSA plat- 
form. Adaptations to NEST and Arbor include the development of correspond- 
ing interfaces for spike exchange between the simulators, using MPI laying the 
groundwork for the neuroscience workflow (Sect. 5). Co-simulation benchmarks 
show that spiking network simulations with NEST running on the cluster module 
can be extended such that spikes generated in NEST drive compartmental neu- 
rons simulated with Arbor on the booster module without runtime penalty [29]. 
Moreover, the simulation time of NEST has been significantly reduced by opti- 
mizing the spike-delivery algorithm hiding memory fetch latency [29], which 
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contributes to more efficient co-simulation. We consider the optimizations a gen- 
erally useful contribution to large-scale network simulation as they are applicable 
to other simulators for pulse-coupled networks with high connection degrees. 
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Abstract. The Human Brain Project (HBP) (https://humanbrainpro 
ject.eu/) is a large-scale flagship project funded by the European Com- 
mission with the goal of establishing a research infrastructure for brain 
science. This research infrastructure is currently being realised and will 
be called EBRAINS (https://ebrains.eu/). The wide ranging EBRAINS 
services for the brain research communities require diverse access, pro- 
cessing and storage capabilities. As a result, it will strongly rely on e- 
infrastructure services. The HBP led to the creation of Fenix (https:// 
fenix-ri.eu/), a collaboration of five European supercomputing cen- 
tres, who are providing a set of federated e-infrastructure services to 
EBRAINS. The Fenix architecture has been designed to uniquely address 
the need for a wide spectrum of services, from high performance com- 
puting (HPC) to on-demand cloud technologies to identity and access 
federation, for facilitating ease of access and usage of distributed e- 
infrastructure resources. In this article we describe the underlying con- 
cepts for an audience of computational science end-users and develop- 
ers of domain-specific applications, workflows and platforms services. To 
exemplify the use of Fenix, we will discuss selected use cases demonstrat- 
ing how brain researchers can use the offered infrastructure services and 
describe how access to these resources can be obtained. 


Keywords: Fenix - Human brain project * Distributed 
e-infrastructures 


1 Introduction 


Today, advances in addressing grand challenges depend on the ability of 
researchers coming from different geographic regions to effectively collaborate 
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and having flexible access to distributed e-infrastructure services. Fenix supports 
the former and provides the latter as it enables researchers, e.g., to collaborate on 
data curation, aggregation and sharing by providing federated storage sources 
as well as to use high-capability resources like High Performance Computing 
(HPC) systems. 

'The brain research community is a diverse community that applies a large 
variety of methods. It is thus not surprising that the requirements concerning 
e-infrastructure services are rather diverse. While some teams need massively- 
parallel HPC systems for large-scale simulations, others are producing extreme- 
scale data sets and employ advanced and potentially compute-intensive data 
analysis techniques. Given the size of the data sets, data analysis and making 
these data sets available to the wider community needs to be done in such a 
way that data transport can be avoided. This does not only help reducing costs 
or improving performance, in a number of cases this is mandatory given the 
involved amount of data. Yet other research teams perform computational and 
data analysis tasks requiring compute resources at smaller scale, but need to do 
this in an interactive manner. 

Fenix is creating an infrastructure layer comprising services that are federated 
in a rather lightweight fashion. It is designed such that quite different types 
of compute services ranging from HPC to Cloud as well as different types of 
data repositories can be integrated. This initiative is realised by five European 
supercomputing centres, namely BSC in Spain, CEA in France, CINECA in 
Italy, CSCS in Switzerland, and JSC in Germany. Fenix is organised in such a 
way that other resource providers may join in the future. 

'T his paper is organised as follows: In the next section we introduce the general 
concepts that led to the architecture of Fenix. In Sect. 3 we provide details on 
the Fenix services and discuss in Sect. 4 how EBRAINS services can make use 
of these. In Sect. 5 we describe how resources are allocated to users from HBP, 
before providing a summary and outlook in Sect. 6. 


2 Fenix Concept 


'The current architecture of Fenix is based on the general consideration that a 
clear separation between an infrastructure service layer and a platform service 
layer is beneficial. Such a layered approach is commonly used for creating Cloud 
infrastructures (see, e.g., [2]), where the terms Infrastructure-as-a-Service (IaaS) 
and Platform-as-a-Service (PaaS) are widely used. 

For Fenix we prefer to use the terms infrastructure service layer and platform 
service layer. The platform service layer encompasses all services that are specific 
for a given research domain. They are not necessarily useful for other domains or 
would require significant adaptations. A typical example are web-based portals, 
such as the EBRAINS Collaboratory!. While such portals are needed for almost 
any research infrastructure, their organisation is highly domain specific. The 


1 https:/ / wiki.ebrains.eu. 
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infrastructure service layer includes a set of services that allow implementing 
these platform services and are sufficiently generic for being useful for differ- 
ent research communities. One example are machines for deploying any of the 
aforementioned portal services, which are typically offered in a virtualised envi- 
ronment. Using Virtual Machine (VM) technology allows for better exploitation 
of the hardware resources as a larger number of VMs, which typically only need 
the resources of a few CPU cores, can be deployed within a single physical 
machine. 

'The infrastructure services are organised such that they can be provided 
by multiple, geographically distributed resource providers. While this approach 
adds the complexity related to the federation of these services, the approach has 
a number of important benefits, which we will discuss below. 

Most of the end-users are not expected to use the Fenix infrastructure ser- 
vices directly, but rather connect to the platform services that are deployed on 
top of these infrastructure services as shown in Fig. 1. For specialist users there 
is, however, the option to directly access the infrastructure services. Àn exam- 
ple are users performing simulations on massively-parallel HPC systems, who 
typically directly access these systems to compile and execute their simulation 
applications. 

'The layered approach has multiple benefits. In general, a layered approach 
and the resulting separation of concerns helps to manage complexity. From the 
perspective of the platform service providers, the abstraction of an infrastruc- 
ture service layer can help improving sustainability and performance due to the 
distributed nature of the infrastructure service layer, involving multiple infras- 
tructure resource providers. Resource providers can be replaced, for instance 
when funding conditions change, or the number of resource providers could 
be changed according to the needs of the platform service layer. Furthermore, 
platform service providers are enabled to improve on resilience by replicating 
their services over multiple sites. Another benefit of a distributed infrastruc- 
ture is improved data locality. With a larger number of infrastructure resource 
providers, the probability that storage resources are available in geographic prox- 
imity of the data source increases. From the perspective of the infrastructure 
service providers, the layered approach has the benefit that it allows the consol- 
idation of their service offerings when supporting multiple science communities. 
Finally, it creates opportunities for improving the utilisation of the offered hard- 
ware resources. 


3 Fenix Compute and Data Services 


In this section we provide an overview of the current service portfolio offered by 
Fenix, which was developed on the basis of an analysis of today's needs of the 
brain research communities. With other communities starting to use Fenix, the 
current portfolio of services is anticipated to change. 
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Fig. 1. Overview of the Fenix architecture as described in Sect. 2. Details on the infras- 
tructure services are provided in Sect. 3. 


The Scalable Compute Services (SCC) abstract large-scale computing 
resources. T'hese are HPC systems with a larger number of compute nodes with 
1-2 CPUs and possibly additional compute accelerators like GPUs. SCC services 
can be used for running highly parallel simulation applications, but are also suit- 
able for data analysis tasks, involving extreme-scale data sets. SCC resources are 
managed by a batch queuing system, which schedules jobs such that hardware 
utilisation is optimised. 

As an increasing demand for interactive access to compute resources is 
observed, Fenix introduces Interactive Compute Services (IAC) services. These 
allow end-users to obtain ad-hoc access to single compute nodes where interac- 
tive frameworks like Jupyter are offered. Typical usage scenarios are interactive 
analyses, visualisations, and steering of simulations running on SCC. 

VM services offer access to on-demand virtualised machines. The prime use 
case for this service is the deployment of platform services running in a “24/7” 
mode, for instance web-based portal services. 

To cope and comply with different and in parts incompatible needs and 
requirements, Fenix introduces two classes of data repositories. An Archival Data 
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Repository (ADR) is a storage system for long-term storage of data objects in a 
shareable manner. Such data repositories must therefore feature a standardised 
interface with easy to install clients and allow for federation. Mechanisms sup- 
porting flexible and fine-grained access control are another important feature. 
Fenix decided for the widely used Cloud object storage interface Swift?. 

Unlike an ARD, an Active Data Repository (ADR) is not a federated data 
repository that is relatively openly accessible from outside a data centre. An ACD 
is meant to be used for storing (copies of) private data sets and will typically 
offer 10 — 100x more bandwidth as well as significantly lower latency. Such 
features are important for data repositories connected to SCC services. A typical 
implementation of an ACD is based on a parallel file system with a POSIX 
interface like Lustre? or Spectrum Scale*. 

Both types of data repositories will be connected through a Data Mover 
service that will allow to asynchronously copy or move data back and forth. 

It is important to note that the different services have different security 
requirements. Some of them, like SCC, IAC and ACD, are realised in an HPC 
environment with tightly restricted access policies. VM services are deployed 
in a Cloud environment with an open connectivity. This will, e.g., allow users 
unknown to the data centre with weak or no credentials to connect to the plat- 
form services deployed on such resources. An ARD is connected to both environ- 
ments and thus can serve as a bridge between both worlds. Other connections 
between services deployed in different environments are subject to negotiations 
to identify the right balance between the realisation of advanced workflows and 
security concerns. 

Except for these restrictions, Fenix allows to combine the use of different ser- 
vices at different locations within a single project. This is a significant advantage 
compared to similar service offerings in Europe. Achieving this depends on the 
following prerequisites: 


— Allservices must be integrated into a single Authentication and Authorisation 
Infrastructure (AAT) such that a user can connect with the same credentials 
to any service offered at any Fenix site. 

— Resource management must be centralised such that both, a Fenix user as 
well as a Fenix resource provider, have an overview of the resources that are 
still available or have already been consumed. 

— For coherent management of access control, a central service is needed that 
makes the necessary attributes available. 


At the time of writing this article, a first version of the AAI is being put in 
operation, while the central resource management and attribute service, which 
is called FURMS, is still under development. 

While the Fenix service portfolio is provided using general-purpose hardware 
technologies, the concept described in this section also allows for provisioning 


? https:/ /wiki.openstack.org/wiki/Swift. 
3 https://lustre.org/. 
^ https://www.ibm.com/products/spectrum-scale. 
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of infrastructure services, using special-purpose hardware solutions. Within the 
HBP there are, e.g., ongoing efforts to integrate neuromorphic computing ser- 
vices that are provided on the BrainScales [8] and SpiNNaker [4] systems at the 
Universities of Heidelberg and Manchester, respectively. 


4 Selected EBRAINS Services 


In this section we introduce a selected set of EBRAINS services and discuss how 
these can make use of Fenix services. 

The EBRAINS Brain Simulation Platform comprises a suite of software tools 
and workflows for collaborative brain research that allow researchers to recon- 
struct and simulate detailed models of brain areas. This includes, e.g., simulators 
like NEST [5] and Neuron [7] as well as the neurophysiology data analysis tools 
package Elephant [3]. A simple workflow using Fenix services is shown in Fig. 2: 


1. The input model data is assumed to be stored in ARD #1 and is copied to 
an ACD from where it is accessible to the SCC service. 

2. The simulations are executed using the SCC service, which reads the input 
data from and writes the output data to the ACD. 

3. After completing the simulation, the final data products can be published by 
copying the data to ARD #2. 


HPC environment 


Fig. 2. Example for a brain simulation flow using Fenix services. 


Next, we consider a more complex example related to the EBRAINS Brain 
Atlases. EBRAINS aims to provide access to a new generation of 3-dimensional 
reference atlases of the human and rodent brain, which are defined at different 
scales and modalities. These atlases are based on histological data obtained from 
brain images (see, e.g., [1]). A complex workflow is required to first analyse and 
interpret the images as well as to integrate this data (see, e.g., [6]) and to later 
make the Brain Atlas as well as primary and secondary data products available 
to others. A possible realisation of such a workflow is shown in Fig. 3, which can 
be mapped to Fenix services as follows: 
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1. The primary data products are generated in a lab and stored in an ARD. 

2. SCC services allow processing of extreme-scale data sets as they occur in 
the case of images with a resolution of O(1 um), and facilitate the use of 
compute-intensive data analysis steps. To allow for fast access to the data, it 
will typically be staged from an ARD to an ACD. The resulting data products 
can be published after writing them into an ARD. 

. Multiple analysis steps using SCC services may follow. 

4. Final data products may be explored interactively using IAC services. 
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Fig. 3. Schematic view of a possible workflow for creating a Brain Atlas using Fenix 
services. 


5 Resource Allocation 


Part of the resources provided by Fenix are dedicated to HBP research and 
EBRAINS services as well as related research projects. A so-called programmatic 
access model allows to provide these resources to a research community (here the 
brain research community represented by HBP), with the latter being responsible 
for the allocation of the resources to projects proposed by researchers from that 
community. 

The HBP allocates the resources based on a peer-review mechanism that 
follows principles established by PRACE. Such a mechanism is widely used for 
making HPC resources available as it helps to ensure expensive resources being 
used for excellent science. The principles mandate, among other requirements, 
the peer-review process to be transparent and clear to all relevant stakeholders. 
Furthermore, the process must be fair such that all proposals are evaluated solely 
on merit and potential high impact on European and international science and 
economy. 

Applicants interested in using Fenix resources for brain research can, at any 
time, submit a proposal. After a technical review by Fenix resource providers, 


? https://prace-ri.eu/hpc-access/project-access/project-access-the-peer-review- proc 
ess/. 
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the EBRAINS Infrastructure Allocation Committee (LAC) is responsible for con- 
ducting or managing the scientific assessment in case of small- or large-scale 
resource requests, respectively. Based on the outcome of the review, the IAC can 
in the case of small-scale projects decide itself on whether to approve or reject a 
proposal. In the case of large-scale projects, the IAC prepares a decision-making 
proposal for the Directorate of the HBP. 


6 Summary and Outlook 


Fenix is an initiative that is realising a broad set of federated infrastructure 
services. The approach is based on a generic concept that aims for a separa- 
tion of infrastructure and platform services. While the former are generic and 
of use for a variety of research communities, the latter are research domain spe- 
cific. The approach allows research communities to establish distributed research 
infrastructures adapted to their needs. The brain research community is the key 
driver for Fenix and two examples of how this community can leverage services 
and resources from Fenix have been discussed. Similar efforts towards IT-based, 
distributed research infrastructures can, however, also be observed for other sci- 
ence communities. 

The HBP has established mechanisms for allocating resources offered by 
Fenix, which is open for researchers from HBP but also for brain researchers 
at large. Other scientists can also apply for Fenix resources through regular calls 
for proposals managed by PRACE. 
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Abstract. In recent years, Independent Component Analysis (ICA) has 
successfully been applied to remove noise and artifacts in images obtained 
from Three-dimensional Polarized Light Imaging (3D-PLI) at the meso- 
scale (i.e., 64jum). Here, we present an automatic denoising procedure 
for gray matter regions that allows to apply the ICA also to microscopic 
images, with reasonable computational effort. Apart from an automatic 
segmentation of gray matter regions, we applied the denoising procedure 
to several 3D-PLI images from a rat and a vervet monkey brain section. 


1 Introduction 


Studying the structure and function of the brain requires dedicated imaging 
techniques, allowing to map the highly complex nerve fiber architecture both 
with high resolution and over long distances. The neuroimaging technique Three- 
dimensional Polarized Light Imaging (38D-PLI) [1,2] was designed to reconstruct 
the three-dimensional orientations of nerve fibers in whole brain sections with 
micrometer resolution. 

To remove noise and artifacts in 3D-PLI images, Independent Component 
Analysis (ICA) has successfully been used [10-12]. However, the ICA has only 
been applied to mesoscopic images with a resolution of 64 jum pixel size and not 
to microscopic images with a resolution of 1.33 um pixel size so far. In order 
to resolve single nerve fibers, e.g. in the cerebral cortex, such a microscopic 
resolution is required. Light scattering, thermal effects, inhomogeneity of optical 
elements, or simply dust on the used filters are noise sources, which combined 
with the weak birefringence 3D-PLI signal in cortical areas inevitably lead to a 
low signal-to-noise ratio (SNR) and reconstruction errors. 

Identifying and removing these noise components in microscopic 3D-PLI 
images is very challenging. The amount of data that has to be processed is 
extremely large and the sampling has to be done differently as compared to 
mesoscopic images. When applying the developed ICA method on microscopic 


© The Author(s) 2021 
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images, the characteristic differences of the signal strengths in white matter and 
gray matter brain regions need to be taken into account. As the birefringence 3D- 
PLI signal of densly packed nerve fibers (i.e., fiber bundles) of the white matter 
proceeding within the sectioning plane are very strong and show a higher SNR 
than the less dense fiber tracts present in the gray matter, the denoising proce- 
dure needs only to be applied in regions of gray matter, which massively reduces 
the required computing time. 

Here, we present an automatic ICA denoising procedure for gray matter 
areas in microscopic 3D-PLI images. It consists of an automatic segmentation 
of gray matter, followed by a data-parallel ICA artifact removal with automatic 
classification of noise and signal activations. 


2 Methods 


2.1 Preparation of Brain Sections 


Brain sections from a Wistar rat (3 months old, male) and a vervet monkey (2.4 
years old, male) were selected for evaluation.! The brains were removed from the 
skull within 24h after death, fixed in a buffered solution of 4% formaldehyde for 
several weeks, cryo-protected with 296 DMSO and a solution of 2096 glycerin, 
deeply frozen, and cut along the coronal plane into sections of 60jum with a 
cryostat microtome (Polycut CM 3500, Leica, Microsystems, Germany). The 
resulting brain sections were mounted onto a glass slide each, embedded in a 
solution of 20% glycerin, cover-slipped, sealed with lacquer, and measured with 
3D-PLI up to one day afterwards. 


2.2 Three-Dimensional Polarized Light Imaging (3D-PLI) 


3D-PLI reconstructs the nerve fiber architecture of the brain with micrometer 
resolution. By transmitting linearly polarized light through unstained histologi- 
cal brain sections and analyzing the transmitted light with a circular analyzer, 
the birefringence of the brain section is measured, thus providing information 
about the three-dimensional orientations of the highly birefringent nerve fibers 
(myelinated axons) in the tissue [1,2]. The 3D-PLI measurements were performed 
with the same setup as described in [23] (LMP-1, Taorad GmbH, Germany), 
using incoherent green light with a wavelength of about 550nm. During the 


1 All animal procedures have been approved by the institutional animal welfare com- 
mittee at Forschungszentrum Jülich GmbH, Germany, and are in accordance with 
European Union guidelines for the use and care of laboratory animals. The vervet 
monkey brain was obtained when the animal was sacrificed to reduce the size of the 
colony, where it was maintained in accordance with the guidelines of the Directive 
2010/63/eu of the European Parliament and of the Council on the protection of ani- 
mals used for scientific purposes or the Wake Forest Institutional Animal Care and 
Use Committee IACUC #A11-219. Euthanasia procedures conformed to the AVMA 
Guidelines for the Euthanasia of Animals. 
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measurement, the direction of polarization of the incoming light was rotated by 
p = {0°,10°,...,170°} and the transmitted light behind the circular analyzer 
was recorded by a CCD camera (Qimaging Retiga 4000R) for each rotation 
angle, yielding a series of N — 18 images. The pixel size in object space was 
about 1.33 jum. For each image pixel, the measured intensity values form a sinu- 
soidal light intensity profile (PLI-signal, I(p)). The average value of the signal, 
i.e. the polarization-independent transmitted light intensity, is called transmit- 
tance and is a measure for tissue absorption and scattering (highly scattering 
tissue components such as nerve fibers appear dark in the transmittance image). 
The amplitude of the normalized signal is called retardation and indicates the 
strength of birefringence of the tissue. It is related to the out-of-plane angles 
of the nerve fibers in the brain section (in-plane nerve fibers show very high 
birefringence, while out-of-plane nerve fibers show much less [22]). The phase 
of the signal indicates the in-plane direction angle of the nerve fibers. Com- 
bining in-plane and out-of-plane angles, 3D-PLI allows to reconstruct the full 
three-dimensional orientations of the nerve fibers. 


2.3 Segmentation of White and Gray Matter 


Morphologically, brain tissue consists of two different tissue types: Gray matter 
and white matter. Gray matter contains various components, such as neuronal 
cell bodies, dendrites, synapses, glial cells, blood capillaries as well as myelinated 
and unmyelinated axons. Most of the gray matter regions are located at the outer 
surface of the brain (cortex), but also inner parts of the brain (i.e., sub-cortical 
nuclei) contain islands of gray matter. White matter is mainly composed of 
myelinated and unmyelinated axons. The largest portion of myelinated axons is 
located in the white matter. 

For the ICA method presented here, it is necessary to consider gray matter 
regions separately from white matter regions.? In the following, we present a 
fully automated procedure to generate masks of white and gray matter. 

As nerve fibers (myelinated axons) are highly birefringent, all regions with 
high birefringence signals (i.e. large retardation values r > rthres in the 3D-PLI 
measurement, cf. Fig.1(a) in orange) can be considered as white matter. (The 
determination of threshold values such as rires will be described below.) On the 
other hand, regions with low birefringence signals (r < rthres, Fig. 1(a) in blue) 
do not necessarily belong to gray matter, because regions with a small number 
of myelinated fibers, crossing nerve fibers, and nerve fibers that point out of the 
brain section (out-of-plane fibers) also yield low birefringence signals [22]. 

Studies by Menzel et al. [20] have shown that regions with crossing nerve 
fibers and regions with in-plane parallel nerve fibers yield similar transmittance 
values Jr, while regions with out-of-plane nerve fibers show lower transmittance 
values. Gray matter regions, on the other hand, show notably higher trans- 
mittance values. Hence, we can use the transmittance value in the region with 


? Note that we here define white matter as all regions (/image pixels) that contain 
myelinated nerve fibers. Anatomically, some of these regions might be known as gray 
matter because they only contain a small amount of myelinated nerve fibers. 
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maximum retardation Imax as a reference value (we expect that this region con- 
tains mostly in-plane parallel nerve fibers) and can then define all regions with 
similar or lower transmittance values as white matter (0 < Ip S Imax, contain- 
ing crossing and out-of-plane fibers). All other brain regions are considered to be 
gray matter. To separate brain tissue from background, we make use of the fact 
that the transmittance in the recorded images is expected to be much higher 
outside of the tissue than within the tissue (Ir > lower, cf. Fig. 1(b) in gray). 

'To enable an automated segmentation into white and gray matter regions, we 
consider the retardation and transmittance histograms (consisting of 128 bins, 
see Fig. 1 on top). Before computing the transmittance histogram, the values are 
normalized to [0, 1] and a median filter with circular kernel (radius of 10 pixels) 
is applied to the image to reduce noise. While the retardation histogram shows 
usually only a single peak at very low retardation values (caused by background 
and gray matter), the transmittance histogram shows one peak for low transmit- 
tance values (white matter), another peak for larger transmittance values (gray 
matter), and a third peak for high transmittance values (background). 

To compute the threshold value ries (Iupper), we determine the point of 
maximum curvature behind (before) the biggest peak in the retardation (trans- 
mittance) histogram, i.e. the position for which the angle difference between two 
neighboring data points becomes minimal. To ensure that the point of maximum 
curvature belongs to the onset of the biggest peak (and not to some other peak 
or outlier), we take the full width at half maximum (FWHM) of the peak into 
account and only search within 10x FWHM behind the retardation peak and 
2.5x FWHM before the transmittance peak, taking the different forms of the 
histograms into account. 

To compute Imax, the region with the maximum retardation value is deter- 
mined. To ensure that the region belongs to a white matter region and is not an 
outlier caused by noise, we use the Connected Components algorithm from the 
OpenCV library [4] (block-based binary algorithm using binary decision trees 
[6]) with eightfold connectivity. We mark all pixels with maximum retardation 
value and count the number of pixels in the largest connected region. If the num- 
ber is at least 512, we select this region as reference. If the number is lower, we 
reduce the maximum retardation value iteratively by 0.01, until we find such a 
region. In this reference region, we compute the average value in the normalized 
transmittance image (pmax, see red vertical line in Fig. 1(b)). This value can be 
used as first estimate to separate white from gray matter. To define the border 
more precisely, we use the point of maximum curvature between [pmax and Tupper 
as new threshold value Nower. 

Taking all this into account, we can compute the masks for white and gray 
matter as follows: 


White Matter: (0 < Ip < Dower) V (r > Tthres); (1) 
Gray Matter: (lower < Ir € Tupper) A (r € Tthres)- (2) 


All image pixels that fulfill Eq. 1 (Eq. 2) are considered as white (gray) matter, 
see Fig. 1(b) in white. All other image pixels are considered as background. 
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Fig. 1. Mask generation for white and gray matter, shown exemplary for a coronal 
vervet monkey brain section. (a,b) Top: Histograms of retardation and transmittance 
images obtained from a 3D-PLI measurement (128 bins in [0,1]); the determined thresh- 
old values are marked by vertical dashed lines. Bottom: Image pixels with values belong- 
ing to the orange, blue, and gray shaded regions in the histograms are marked in the 
respective colors. (c) Masks for white and gray matter, computed using the threshold 
values defined in the histograms on top. (Color figure online) 


2.4 Independent Component Analysis (ICA) 


The ICA belongs to the group of Blind Source Separation (BSS) techniques and 
can be used for data decomposition to find statistically independent components 
in a mixture of signals [17]. ICA has been applied to various artifact removal 
tasks, e.g. ocular artifact removal in electroencephalography [13], cardiac arti- 
fact removal in magnetoencephalography [5], and noise-signal-discrimination in 
functional magnetic resonance imaging [21]. 
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In 3D-PLI, a data set consists of a series of N images (one for each rotation 
angle). To avoid that the background interferes with the decomposition, each 
image is divided into background and our region of interest (ROI) with the latter 
containing M pixels. The measurements are flattened and centered to obtain a 
zero-mean data array X with the dimension N x M. The decomposition into 
sources S with the shape N x M requires that the data can be represented as 
a linear mixture of independent signals without additional additive noise, that 
there exist sufficient samples for every extracted feature (general advise is to 
keep k- N? > M with k € {1,2,3,...}), and that the distribution of the sources 
is non-Gaussian. With these prerequisites, the problem can be stated as 


X — AS, (3) 


where A is the so-called mixing matrix with the dimension N x N and is 
yet unknown. Because S and A are both unknown, it is impossible to make 
a prediction about sign or amplitude of the basis vectors of A. Furthermore, 
we have no knowledge about the number of components in our data set, so we 
assume that the complexity of the data can be mapped by N features. 

Prior to performing the ICA, the data array X is whitened by making use of 
a Principle Component Analysis (PCA) [24] to lower the degrees of freedom to 
N(N —1)/2 [17]. The ICA then estimates W ~ A^! by maximizing the entropy 
as in Infomax-based ICA [3] or by maximizing a measure of non-Gaussianity as 
in FastICA [15,16]. We then obtain 


WX =CHS, (4) 


with the component vector C. We find the activation profiles of the Compo- 
nents in W~! as basis vectors. It was shown that 3D-PLI signals contain sub- as 
well as super-Gaussian independent components, therefore FastICA or Extended 
Infomax [19], an extension of the Infomax algorithm, can be used. This work uses 
the Extended-Infomax Implementation of the MNE-Toolbox [14]. 


2.5 Automatic Noise Removal with ICA 


The activation profiles given by the ICA can be distinguished into two categories: 
noise activation profile and signal activation profile. Because we know the PLI- 
signal shape from theory, we know the shape of the basis vectors we are looking 
for in our mixing matrix W71 ~ A. A simple classification problem is visualized 
in Fig. 2. The sinusoidal shaped activations are the ones to keep and we want to 
drop the activations that resemble random distributions. 

The automatic identification is realized by fitting the expected (theoreti- 
cal) function to each of the N activations. As identification measure, the mean 
squared error (MSE) is calculated for every fit and compared to the mean of all 
MSE values. When the MSE of the i-th fit is smaller than 1/10 of the mean of 
all MSE values, we assume that the activation belongs to a signal component. 
Otherwise, we assume that the activation belongs to a noise component. 
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Fig. 2. Archetypal signal discrimination for ICA artifact removal: (a) sinusoidal PLI- 
signal activation, (b) random PLI-noise activation from e.g. thermal noise or light 
scattering. 


After the detection of all noise activations, we construct a denoised mixing 


matrix Ww; & Aq by just zeroing out the respective column: 
i ae T 
— Signal Activation 1 — — Signal Activation 1 — T 
— Signal Activation 2 — — Signal Activation 2 — 
— Signal Activation J — — Signal Activation J — 


—1.. EN 
Wa = | — Noise Activation 1 —| ^ | — 0 — 
— Noise Activation 2 — — 0 — 
— Noise Aetivation K — = 0 = 
(5) 


The denoised data array X4 can then be obtained by remixing the previously 
unmixed components: 


Wy'C=W,'WX = Xa. (6) 
Estimation of Signal Enhancement. The noise reduction of the artifact 


removal is measured with a weighted chi-square statistic introduced by [10]. 
It is based on the reduced chi-squared statistic defined by 


A. (Ipi) — F (pi)? 
EDS aar (7) 


with v for the degrees of freedom, N the number samples (here measure- 
ment angles), /(p;) the measured light intensity for an angle, f(p;) the expected 
function, and o(z, y, p) the standard error for every image point and angle. This 
statistic is applied to both noisy and denoised data leading to x2,,, and x?54. We 
denote the quotient of these two measures by relative Goodness of Fit (rGoF). 
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In case one component is missing and the denoised signal is inherently different, 
an additional weighting factor w defined by 


1 V o) - F (oi) 
wat i) 
i=1 
is included in the denominator, which penalizes large deviations between the 
expected function of the noisy signal f(p;) and the expected function of the 


denoised signal f*(p;). The so obtained measure is denoted by weighted relative 
Goodness of Fit (wrGoF): 


rGoF _ xe (9) 
d Xica 'w' 
where wrGoF > 1 is associated with a signal improvement while wrGoF « 1 
is associated with a signal degradation. 


wrGoF — 


Parallelization Concept. The parallelization is implemented by distributing 
the workload of the ICA problem equally in N parts to N workers via the 
Message Passing Interface (MPI) with mpi4py [7-9]. Every n-th element is sent 
to the n-th worker. After converging the result of all workers is collected and 
fused to the end result. 


3 Results 


The denoising procedure was applied to 22 brain sections (14 coronal rat brain 
sections and 9 coronal vervet brain sections). Every section was masked with 
a gray matter mask (as described in Subsect.2.3) to remove background and 
white matter areas. In all cases, three components of interest were found, each 
with a sinusoidal activation function. The signal activations were kept and the 
noise activations were automatically removed. The resulting components and 
activations are shown exemplary for rat brain section no. 100 in Fig. 3. 

The amount of wrGoF values are in all sections greater than one (799.990) 
and mostly greater than ten (29996). A spatial distribution of the values is 
shown in Fig. 4 for the rat brain section (right) and a vervet brain section (left). 
In Fig.5, three selected intensity profiles are shown for the rat brain section. 
Each individual profile shows an improvement and the denoised profile describes 
the measurement in a smooth way and is not influenced by outliers. 
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Fig. 3. Independent Components Ci; for rat brain section no. 100 with their associ- 
ated activations A;. The noise components are in the first and second row, the signal 
components are in the last row. 
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Fig. 4. Signal enhancement in a brain section visualized via the wrGoF measure. Left: 
wrGoF-map of vervet brain section no. 627. Right: wrGoF-map of rat brain section 
no. 100. 


'The alternating parallelization approach achieves a linear speedup of up to 12 
cores (i.e. half of the cores on a node on the JURECA supercomputer [18]). The 
usage of all cores on the node only improves the speedup by two additional units 
as seen in Fig. 6. Overall, a weak scaling can be observed. While four nodes give 
a speedup of factor 2 with respect to one node, six nodes only offer a speedup of 
2.5. For a complete vervet brain section (sample size ~ 10° pixels), the run time 
of the denoising routine for a single worker is about five hours. Using a whole 
node lowers this to half an hour, and using four nodes lowers the run time to 
15 min. 

This scaling behavior is the same for the rat and the vervet brain sections. 
Furthermore, the number of workers and therefore the number of partial ICA 
problems do not interfere with the quality of the denoising process. The percent- 
age of wrGoF values greater than one (799.996) or greater than ten (79996) are 
not influenced by the amount of parallelism. 
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Fig. 5. Signal enhancement by the ICA denoising procedure shown for rat brain section 
no. 100: The image on the upper left shows the transmittance of the brain section for 
gray matter (background and white matter are displayed in black). The graphs show the 
intensity profiles of three selected areas (colored dots in transmittance image) before 
and after the denoising procedure. 


244 - . 4| 
= = Ideal Scaling 4 64. == Ideal Scaling L 
22] m Ratruns "d v Vervet runs PF 
207 vw Vvervetruns 4 4 
18 4 77 5 7 ri 
16 * Pd 
4 g 4 
2 144 ^ 34] 7 
$ 12 | P " A " d 
w 104 * Y WY 34 P2 
84 ^ a å 
Fd dt A T 
64 7 24 4 v 
^17 i Fi Y 
J v Y 
4 P 1+7 
2 6 10 14 18 22 1 2 3 4 5 6 
# Cores # Number of Nodes á 24 cores 


Fig. 6. Scaling behavior of the ICA denoising procedure. Left: Intra-node scaling 
behavior for 1-24 Cores. Right: Global scaling behavior. 
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4 Discussion 


In this work, an automatic denoising procedure for 3D-PLI data based on Inde- 
pendent Component Analysis (ICA) for high-resolution PM data (with 1.33 jum 
pixel size) was presented. Previous works studied the denoising of low-resolution 
LAP data (with 64um pixel size) [10-12], but the application on PM data 
was limited due to computational and memory constraints and was not fully 
automatic. Furthermore, the existing solutions were not suitable for a high- 
throughput workflow because masks for tissue had to be manually created or 
adjusted. 

To overcome these limitations, three key steps had to been taken: The first 
step was to develop an automatic segmentation of brain tissue into white and 
gray matter, so that the ICA can work targeted on the noisy gray matter. The 
second step was an automatic detection of signal components in the ICA acti- 
vations. The investigated brain sections showed good separability by a simple 
MSE measure. The zeroing of noisy components was straightforward to imple- 
ment. Due to the fast convergence and easy separability, there was no need 
for constraints which would only complicate the procedure and add expensive 
hyperparameter training as in [10]. The third step was to parallelize the ICA in 
a pleasingly parallel manner to evenly distribute the workload and ensure that 
each worker receive similar statistics. This showed a weak, but significant scaling 
as shown in Fig. 6. 

'The obtained results for the wrGoF measure were consistently better than the 
ICA denoising for LAP data presented in [10,11]. The values were not influenced 
by the amount of parallelism. The amount of wrGoF values are in all brain 
sections greater than one (799.996) and mostly greater than ten (79996). Overall, 
the results are very promising for high-throughput denoising of high-resolution 
3D-PLI sections. 
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Abstract. The study of the visual system of the brain has attracted 
the attention and interest of many neuro-scientists, that derived compu- 
tational models of some types of neuron that compose it. These findings 
inspired researchers in image processing and computer vision to deploy 
such models to solve problems of visual data processing. 

In this paper, we review approaches for image processing and com- 
puter vision, the design of which is based on neuro-scientific findings 
about the functions of some neurons in the visual cortex. Furthermore, 
we analyze the connection between the hierarchical organization of the 
visual system of the brain and the structure of Convolutional Networks 
(ConvNets). We pay particular attention to the mechanisms of inhibition 
of the responses of some neurons, which provide the visual system with 
improved stability to changing input stimuli, and discuss their imple- 
mentation in image processing operators and in ConvNets. 


Keywords: Brain-inspired computing - Image processing - Inhibition 


1 Introduction 


'The development of the visual system of humans takes a number of phases, which 
include tuning the synaptic connections between neurons in the different areas 
devoted to the processing of different visual stimuli. In newborns, for instance, 
many connections between the Lateral Geniculate Nucleus (LGN), which is the 
first part of the brain devoted to visual processing, and the area V1 of the visual 
cortex are not formed yet. Similarly, the connections between neurons in the 
area V1 and subsequent areas start developing after the first month of life. 
'The tuning process of the receptive fields of the neurons of the visual sys- 
tem and the development of their inter-connected network can be compared to 
the training process of Artificial Neural Networks (ANNs). Since the beginning 
of their development, indeed, the design of ANNs has been largely inspired by 
the way the brain works, i.e. processing information via a network of neurons 
organized in a hierarchical fashion. Despite the resemblance of the Rosenblatt's 
perceptron with the physiological structure of a neuron, there is no actual rela- 
tion between the processing of ANNs and the neural processes in the brain. 


© The Author(s) 2021 
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Many researchers in computer vision and image processing found inspira- 
tions from neuro-physiological studies of the visual system of the brain to design 
novel computational models that could process visual data. In 1959, Hubel and 
Wiesel carried out experiments on the visual cortex of cats and demonstrated 
the existence of the simple cells, which are neurons with an elongated receptive 
field. T'heir primary function is to detect edges and lines. Originally, the simple 
cells were modeled using Gabor functions [11,24] and used in image processing 
and computer vision applications, especially for texture description and analy- 
sis [16]. Subsequently, Hubel and Wiesel precised that simple cells receive inputs 
from certain co-linear configurations of the circular receptive field of neurons in 
the LGN [20]. Computational models based on Gabor functions were not able 
to describe all the properties of simple cells and ignored the contribution of 
LGN neurons for the processing of visual stimulti. In [4], a computational model 
based on the combination of the responses of Difference-of-Gaussians functions, 
which modeled the LGN receptive fields, was proposed. It achieved better con- 
tour detection performance than models based on Gabor functions and showed 
more properties of the simple cells in area V1 of the visual system of the brain, 
such as contrast invariant orientation tuning and cross orientation suppression. 

Artificial neural networks (ANNs) and, in particular, convolutional neural 
networks (ConvNets) received much attention and showed some similarities with 
the visual system of the brain especially regarding its hierarchical organization. 
Although the training of neural network is formulated as an optimization prob- 
lem and does not relate with biological processes, in [25] it was shown that the 
convolutional kernels learned in the first layer of AlexNet resembled the Gabor 
functions that were used to model the receptive field of neurons in the area V1 
of the visual system. Similarly, unsupervised approaches for image analysis like 
Independet Component Analysis also learned features for image processing that 
resemble the Gabor-like receptive fields of neurons in area V1 [18]. 

Neuro-scientific and neuro-physiological studies of the mechanisms and sys- 
tems that our brains uses to process external inputs have influenced also the 
developement of other branches of pattern recognition and artificial intelligence, 
such as sound signal processing. Patterson et al., in 1986, modeled the response 
of the cochlea membrane in the inner auditory system as a bank of Gamma- 
tone filters [33]. They called Gammatonegram the result of the processing of 
an input signal by a Gammatone filter bank. Similarly to the spectogram, the 
Gammatonegram is a time-frequency representation of the sound in which the 
energy distribution over time and specific bandwidths is described. Parts of 
higher energy intensity correspond to regions of the cochlea membrane that 
vibrates more according to the energy of the mechanical sound pressure waves 
that hit the outer part of the auditory system. This model was exploited in 
[45-47] as input to a trainable feature extractor, the design of which was inspired 
by the activation of the inner hair cells, placed behind the cochlea, which convert 
the vibration into electrical stimuli on the auditory nerve. 

This paper focuses on the relation between neuro-scientific studies and 
progress in Computer Vision and Image Processing, providing an overview of 
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methods and aspects that concern detection and processing of low-level features 
in images until more complex computations in convolutional networks. 


2 Brain-Inspired Processing of Visual Data 


One of the pioneering architectures for image processing and computer vision 
inspired by knowledge of the brain processes of vision was the neocognitron 
network [13]. It modeled the hierarchical arrangement of the visual system of the 
brain by layers of S- and C-cell components, which are computational models of 
the simple and complex cells discovered by Hubel and Wiesel [20]. The weights 
of the neocognitron network were learned via an unsupervised training process, 
based on self-organizing maps. This training resulted in a hierarchy of S- and 
C-cell units that resembled the organization of the human visual system. 

In the following of the section, some of these approaches are discussed, and 
part of the focus is given to the phenomena of inhibition that contribute to 
increase the selectivity of neurons to specific visual stimuli and how they are 
embedded in operators for processing of visual data. 


2.1 Edge and Line Detection 


Simple cells in area V1 of the visual cortex receive inputs from LGN cells in the 
thalamus of the brain and have the function of detecting elongated structures 
that contain high contrast information. The receptive fields of LGN cells are 
modeled by on- and off-center Difference-of-Gaussians (DoG) functions, while 
those of simple cells are modeled as co-linear arrangement of DoG functions. 
Originally, simple cells were modeled with Gabor functions, bypassing the con- 
tribution of the LGN cells. Computational models based on Gabor filters were 
used for contour and line detection and included in hierarchical architectures for 
object detection [36] and face recognition [34] tasks. 

Although Gabor filters were used, initially, to model the simple cell receptive 
fields [24], they did not reproduce certain properties, such as contrast invari- 
ant orientation tuning and cross orientation suppression. These properties were 
achieved by a non-linear model, named CORF (Combination of Receptive Fields) 
for contour detection [4]. It is based on the combination of co-linearly aligned 
DoG functions, modeling the way simple cells combine the response of LGN 
cells. A mechanism for tolerance to curvature of lines and contours, based on 
a non-linear blurring, was proposed in the CORF model to improve the results 
when deployed in image processing pipelines. 

An implementation of CORF, named (B-)COSFIRE (Combination of Shifted 
Filter Responses), where B- stands for bar-selective, was demonstrated to be 
successful for the detection of thick lines in images and applied to blood vessel 
delineation in retinal images (see Fig.1) [7,41], road and river segmentation 
in aerial images [44], crack detection in pavement images [38]. An example of 
the response map computed by a B-COSFIRE filter and its thresholded binary 
map are shown in Fig. 1b and Fig. 1c, respectively. A curved receptive field was 
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configured in [35], to detect high curvature points of the retinal vessel tree. In 
[40,42], the authors demonstrated that a bank of B-COSFIRE filters, configured 
to delineate lines of different thickness, can be used as feature extractors and 
combined with a classifier to perform complex decisions. 


(a) 


Fig. 1. (a) Example retinal image, the (b) response of the B-COSFIRE filter and (c) 
the corresponding binary map. 


2.2 Object(-part) Detection 


The response of neurons in area V1 are forwarded for further processing to 
neurons in areas V2 an V4 of the visual cortex, which are tuned to respond to sets 
of curved segments or vertices of some preferred orientation and badnwidth [32]. 
'These properties can be interpreted as functions for detection of parts of objects. 

Based on the principle of combining the responses of line and edge detectors 
at different orientations and with a certain spatial arrangement, an implemen- 
tation of the COSFIRE model that takes as input a bank of Gabor filters of 
different orientation was released [3]. In this case, the receptive fields of neurons 
in area V1 that give input to those in area V4 were modeled by means of Gabor 
functions. However, a hierarchical structure of COSFIRE models can be real- 
ized for more complex tasks like object recognition or scene understanding [5]. 
The COSFIRE model of neurons in area V4 can be trained to detect parts of 
object and used in applications of object recognition. In Fig.2, we show some 
examples of the parts of objects on which V4-COSFIRE models are trained. 
The light-blue ellipses indicate the location and the orientation at which the 
V1-like neuron responses are considered and their combination models a part of 
the object of interest. The configured models can be used to recognize parts of 
objects in other images or together in a filter-bank to extract feature vectors to 
be used in combination with a classifier. 


2.8 Inhibition for Image Processing 


One important aspect of the visual processes that happens in the visual system is 
the mechanism of inhibition. The receptive field of a simple cell, known as ‘clas- 
sical receptive field’ [19], is composed of an excitatory and an inhibitory region. 
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Fig. 2. The configured COSFIRE filters are represented by the set of light blue ellipses 
in the top row, whose orientation indicates the preferred orientation of the Gabor filter. 
In the bottom row, the part of the object that the corresponding COSFIRE filter is 
able to detect (figure from [3]). (Color figure online) 


Many simple cells are know to receive push-pull (or antiphase) inhibition [21]. 
This form of inhibition happens when visual stimuli of given orientation and 
opposite polarity evoke responses of opposite sign [10,12,31]. Furthermore, it 
is known to be the most diffuse form of inhibition in the visual cortex [1]. In 
practice, for a stimulus of given polarity the response of the inhibitory receptive 
field suppresses the response of the excitatory receptive field. 

'This phenomenon was implemented in the CORF operator and it was demon- 
strated to be beneficial for improving contour detection in presence of texture [6]. 
More recently, the effect of the push-pull inhibition was shown to increase the 
robustness of line detection to various types of noise and textured background: 
a novel RUSTICO (Robust Inhibition-augmented curvilinear operator) opera- 
tor was proposed in [37,39]. It was shown to be very effective for line detection 
in presence of noise and texture. RUSTICO is designed as an extension of the 
B-COSFIRE filter for line detection, by including an inhibitory component. In 
Fig. 3a and Fig. 3b, an aerial image of a river and the corresponding ground-truth 
are shown. The binary response map produced by RUSTICO (Fig. 3d) shows a 
more complete reconstruction of the line pattern of interest, i.e. the river, than 
that in the binary map produced by B-COSFIRE (Fig. 3c). 


(b) 


Fig. 3. (a) Aerial image of a river and (b) the ground truth of the river area. The (c) 
binary response map obtained by the B-COSFIRE filter is more noisy and contains 
less of the detected river patterns than the (d) binary response map of RUSTICO. 


110 N. Strisciuglio and N. Petkov 


Another phenomenon of inhibition found in the visual cortex is the surround 
suppression. It consists of neurons, whose response is suppressed by that of 
neighbor neurons in the surrounding of their receptive field [9,49]. The cells 
that exhibit this type of inhibition have a non-classical receptive field (NCRF). 
Practically, this means that the response to a certain stimulus can be influenced 
by the presence of similar stimuli in the surrounding of the receptive field. This 
mechanism of surround suppression was included in image processing operators 
to extend the Canny edge detector [14], a Gabor filter based contour detector [15] 
and in an operator with a butterfly-shaped receptive field [50]. 

More recently, the push-pull inhibition and surround suppression were com- 
bined in a single operator for contour detection, which outperformed its coun- 
terpart operators with single or none inhibition mechanism [30]. 


3 Convolutional Networks for Visual Data Processing 


Convolutional Neural Networks (ConvNets) became the de facto standard for 
image processing and computer vision, because of their effectiveness in deal- 
ing with various visual recognition tasks. Successful applications of ConvNets 
are image and object recognition [17], semantic segmentation [8], place recogni- 
tion [2, 27], image generation and image-to-image translation [22], among others. 

ConvNets are based on convolution operations and exploit the characteristic 
of locality of the patterns of interest. This means that the value at a certain pixel 
location of a response map is detemined by the linear combination of the values 
of a small neighborhood of the corresponding pixel in the input image. From this 
perspective, ConvNets can be considered as a regularized version of multi-layer 
perceptron (MLP) networks. The fully-connectedness means that each neuron at 
a certain layer receives input from all the neurons in the previous layer. In a Con- 
vNet, instead, each neuron (i.e. a convolution kernel) has a very limited number 
of inputs, and it slides over the input signal to compute its response. Although a 
single convolution catches local proprieties of the input signal in small-size neigh- 
boroods, the hierarchical organization of ConvNets allows to assemble more and 
more complex patterns in subsequent steps. 

The hierarchical organization of ConvNets, which arranges a stack of con- 
volutional layers, non-linear activation functions and sub-sampling operations 
resembles the hierarchy of the visual system of the brain. Speculations of this 
type were reinforced by the results obtained by the AlexNet network [25]. On top 
of the improvement of the classification accuracy by a large margin with respect 
to previous approaches, it was shown that the filters learned in the first layer 
of AlexNet resembled Gabor-like receptive fields (see Fig. 4), which are accepted 
computational models of neurons in the area V1 of the visual system of the 
brain [29]. Hence, in the first layer of AlexNet edge and elongated structures of 
different bandwidth are detected. The interpretations consist in that in subse- 
quent layers, the detected edge and line patterns are combined into corner-like 
structures, similarly to the area V2 and V4 of the visual cortex, and into parts 
of objects (anterior and posterior TEO). 
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Fig. 4. Visualization of the convolutional kernels learned in the first layer of AlexNet. 


The convolutions used in ConvNet architectures are linear operators and 
are not able to fully model some non-linear properties of the neurons in the 
visual cortex, e.g. response saturation or cross-orientation suppression. In [51], 
quadradic convolutions, in the form of Volterra kernels, were investigated and 
deployed as substitute of the convolution operations in existing architectures. 
This type of convolutions is more suited for a better approximation of the profile 
of the receptive fields of some neurons in the visual system. The approach was 
extended in [23], in which quadratic convolutional kernels contributed to reduce 
the depth, i.e. the total number of convolutional layers, of existing architectures 
while keeping the detection and classification performance of the corresponding 
deeper original networks. 

On the one hand, the use of quadratic convolutions is justified by the closer 
connection with the function of the receptive field of the complex cells in the 
visual system, and contributed to a relatively small increase of performance. On 
the other hand, they require a much larger number of parameters to be learned, 
slowing down the training and increasing the complexity of the functions to 
be learned. In [51], indeed, due to computational limits, only the first layer of 
convolutions was replaced by Volterra kernels. 

Another type of non-linear unit was proposed in [28], which incorporate the 
framework of the COSFIRE model of the neurons in the area V4 of the visual 
system into a new type of layer for ConvNets. The response of this layer is 
computed by combining the response maps of local simpler features according 
to a spatial structure that is determined in an automatic configuration step. 
During the training of the network, the CNN-COSFIRE layer can be configured 
to detect a certain arrangement of local features, so allowing for a larger receptive 
field that can catch non-local characteristics of the patterns of interest, such as 
parts of or entire objects. It was successfully demonstrated in applications of 
object detection and place recognition where few training samples are available. 


3.1 Inhibition in Convolutional Networks 


ConvNets learn representations, disentangling complex features of the training 
data. Inhibition is believed to be a mechanism for regularization and stability of 
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the processes that happens in the visual system [26], and forms of inhibition are 
learned in ConvNets as well [48]. 

AlexNet deployed a layer called Local Response Normalizer (LRN), which 
implemented a surround suppression mechanism called lateral inhibition. This 
type of inhibition creates a form of competition among neurons in a local neigh- 
boround. The LRN builds on the idea of enhancing peak responses and penalizing 
flat ones on the feature map, making relevant features stand out more clearly. 
Thus, in the implementation, high local responses of one convolutional kernel 
inhibit weaker responses of other convolutional kernels in the same local neigh- 
bourhood. This serves as a form of regularization of the network and improves 
recognition performance. 

In [43], a new type of layer that implements the push-pull inhibition mech- 
anism was proposed, which can be used as a substitute of the convolutional 
layer. The push-pull layer can be trained with back-propagation of the gradient 
of the error and is interchangeable with any convolutional layer in the network. 
However, as it is inspired by neuroscientific evidence of inhibition mechanisms 
that occur in the early stages of the visual cortex, it was deployed as a sub- 
stitute of the first convolutional layer only [43]. Using the push-pull layer in 
ConvNet architectures achieves better performance on image classification tasks 
when dealing with images that have been corrupted with noise or other types of 
artefacts (e.g. jpeg compression, blur, contrast changes and so on). Furthermore, 
when deploying the push-pull layer in ConvNets instead of a convolutional layer, 
the number of parameters to learn does not increase. 


4 Conclusions 


The research fields of image processing and computer vision were influenced by 
discoveries and progress in the understanding of the functions of neurons in the 
visual system. Computational models of different types of neurons formalized 
by neuro-physiological studies of their responses to visual stimuli have been 
deployed for image processing, especially related to low-level tasks such as line 
and contour detection. 

In this paper, we reviewed the developments of edge and contour detec- 
tion algorithms influenced by progress made in the understanding of the visual 
processes that occur in the visual cortex. We paid large attention to the impor- 
tance that inhibitory mechanisms, namely push-pull inhibition and surround 
suppression, have on the robustness of the processing of visual stimuli in noisy 
and textured scenes. Furthermore, we covered the connections that neuro- 
physiological findings have with the development of Convolutional Networks 
and how inhibitory phenomena were explicitly implemented in the architecture 
of these networks with the aim of improving their stability to varying input 
stimuli. 
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Abstract. Recent research on face analysis has demonstrated the rich- 
ness of information embedded in feature vectors extracted from a deep 
convolutional neural network. Even though deep learning achieved a very 
high performance on several challenging visual tasks, such as determin- 
ing the identity, age, gender and race, it still lacks a well grounded theory 
which allows to properly understand the processes taking place inside the 
network layers. Therefore, most of the underlying processes are unknown 
and not easy to control. On the other hand, the human visual system fol- 
lows a well understood process in analyzing a scene or an object, such as 
a face. The direction of the eye gaze is repeatedly directed, through pur- 
posively planned saccadic movements, towards salient regions to capture 
several details. In this paper we propose to capitalize on the knowledge of 
the saccadic human visual processes to design a system to predict facial 
attributes embedding a biologically-inspired network architecture, the 
HMAX. The architecture is tailored to predict attributes with different 
textural information and conveying different semantic meaning, such as 
attributes related and unrelated to the subjects identity. Salient points 
on the face are extracted from the outputs of the S2 layer of the HMAX 
architecture and fed to a local texture characterization module based on 
LBP (Local Binary Pattern). The resulting feature vector is used to per- 
form a binary classification on a set of pre-defined visual attributes. The 
devised system allows to distill a very informative, yet robust, represen- 
tation of the imaged faces, allowing to obtain high performance but with 
a much simpler architecture as compared to a deep convolutional neural 
network. Several experiments performed on publicly available, challeng- 
ing, large datasets demonstrate the validity of the proposed approach. 


1 Introduction 


In the last decade, soft biometric traits have been widely used for person identifi- 
cation because of the robustness to noise, non-intrusiveness, and privacy preser- 
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vation. In the last years Deep learning approaches have been proposed also to 
extract soft-biometric attributes from face images. However, the high perfor- 
mance achieved are always paired with the requirement for high computational 
power and a large dataset for training. Liu et al. [1] proposed a method based 
on two CNNs which are trained for face localization (LNet) and attributes pre- 
diction (ANet). The top network layer, FC is exploited to learn identity-related 
features, such as the gender and race. Layers C3 and C4 are exploited to extract 
Identity-unrelated attributes, such as the facial expression, wearing hat and sun- 
glasses. Samangouei et al. [2-4] proposed a CNN architecture suitable for mobile 
devices, which is based on the analysis of face parts. Recently, Dhar et al. [5] 
considered the usefulness of the outputs of the internal layers of two deep con- 
volutional networks, Resnet 101 and Inception Resnet v2, for the prediction of 
facial attributes. Izadi [6] proposed the fusion of the extracted facial attributes 
with the face image to perform face recognition on a shared CNN architecture. 
Recently different works [7,8] proved that the final representation computed by 
a deep convolutional neural network embeds information not only about iden- 
tity but also on the head pose and illumination. In this paper we propose to 
extract information from internal layer corresponding to HMAX network with 
the purpose of predicting different facial attributes. The HMAX model, which 
has been developed before deep learning took over in many computer vision 
problems, demonstrated the feasibility of a biologically-inspired neural architec- 
ture for face recognition. The model was tested on several publicly available 
databases such as LFW, PubFig and SURF-W [9] providing results at the state 
of the art. In [10], a new C3EF layer, inspired by the ventral and dorsal streams 
of the visual cortex, has been added to perform view-independent face recogni- 
tion. Hu [11] proposed a version of the HMAX model, named 'sparse HMAX', 
addressing the local-to-global structure of the hierarchy, where the S2 bases are 
learned by sparse coding. In this paper we propose a novel hybrid system based 
on the HMAX network architecture. The outputs of the internal S2 layer are 
used as the seeds for extracting interest regions which are then used to generate 
the feature vector for the classification of the facial attributes. The following 
issues are addressed: 


e How the salient feature points extracted from the HMAX architecture can 
improve the prediction of facial attributes. 

e To which extent the devised system can be applied to predict different kinds 
of facial attributes. 

e What is the robustness of an attentitive visual system to variations in head 
pose, lighting and facial expression. 


2 Prediction of Facial Attributes 


Most of the time celebrities or familiar people are remembered because of their 
special hair style, accessories or even clothes. This daily life concept is exploited 
by soft biometrics, or general visual attributes. These attributes can add signif- 
icant information to face images and are quite robust to image degradation and 
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changes in appearance. For this purpose, the internal $2 layer of the HMAX is 
used to detect the most salient points on the subject’s face. The Linear Binary 
Pattern feature extractor is exploited to build a local description of the image 
texture around the selected points. T'he feature vectors computed for each salient 
point are concatenated to produce a global feature vector to characterize the face 
image. The obtained feature vector is fed to a SVM binary classifier to predict 
several visual attributes. In Fig. 1 the general architecture of the proposed frame- 
work is shown. 


Attributes 
Ca sisi 


Fig. 1. Proposed hybrid system for the prediction of facial attributes. 


2.1 The Hierarchical HMAX Network 


HMAX is an hierarchical system that closely follows the organization of visual 
cortex and builds an increasingly complex and invariant feature representation by 
alternating between a template matching and max pooling [12]. As the network 
structure is fixed, a limited number of training examples is required for learning. 
The computational process is hierarchical and it is also invariant to position, 
scale and view-point. Along the hierarchy, the size of the receptive fields and 
the complexity of their optimal stimuli increases. The model consists of four 
computational layers, where simple ‘S’ units alternate with complex C? units. 

The first layer S1 in the HMAX network consists of a bank of Gabor filters 
applied to the full resolution image. The response to a particular filter G, of 
layer S, at the pixel position (X,Y) is given by: 

The first layer of the HMAX model ‘S1’ consists of a bank of Gabor fil- 
ter, the following steps are implemented in (Fig. 2): The Image (at finest scale) 
is [256 *256* 1]. In each image intensity, four Gabor filters are applied over 
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Fig. 2. General architecture of the HMAX model. 


each pixel position. The result of S1 layer (at finest scale) is [246 * 246 * 4]. The 
response of a patch of pixels X to a particular S1 filter G is given by: 


XiGi 
R(X, Y) = 2,XiGi (1) 
y XX) 
The size of the Gabor filter is 11 x 11 and it is formulated as: 
—(ax?4?2y? Qn 
G(x, y) = exp( E eps 2 x) (2) 


20? 


À 


Where X = xcos0 — ysin0 and Y = xsin0 4- ycos0. x and y vary between —5 and 
5, and 0 varies between 0 and r. The parameters p (aspect ratio), c (effective 
width), and » (wavelengh) are set to 0.3, 4.5 and 5.6, respectively. For the local 
invariance (C1) layer, a local maximum is computed for each orientation. They 
also perform a subsampling by a factor of 5 in both the X and Y directions 
[13]. In the intermediate feature layer (S2 level), the response for each C1 grid 
position is computed. Each feature is tuned to a preferred pattern as stimulus. 
Starting from an image of size 256 x 256 pixels, the final S2 layer is a vector of 
dimension 44 x 44 x 400. The response is obtained using: 


R(X, P) = exp) (3) 


The last layer of the architecture is the Global Invariance layer (C2). The max- 
imum response to each intermediate feature over all (X, Y) positions and all 
scales are calculated. The result is a characteristic vector that will be used for 
classification. For the implementation of the HMAX model we use the tool pro- 
posed in [13] As the final layer (C2) is the features vector that corresponds to 
maximums obtained from each $2 output, which are 400 characteristics. These 
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maximum correspond to a certain locations (best coordinates) that are the max- 
imal responses for each patch and image. These coordinates are accumulated and 
projected into the original faces images and used as interest points. 


2.2 Local Texture Description Based on LBP 


LBP is a type of visual descriptor used for classification in computer vision. The 
idea of the texture extraction using LBP is to give to each pixel a code which 
depends on the gray scale of its neighbors. The gray scale of the central pixel 
(ie) is compared to its neighbors in the following formula: 


P 
0, «<0 
LBP(2¢, Yc) = slin — ie)?” X=< 4 
(26.3) = J slin — ie) K 3 (4) 


n=0 


The LBP code of the current pixel is produced by concatenating the 8 values to 
construct a binary code. The center of each window corresponds to the interest 
points obtained from HMAX. 


2.8 Binary Classification with Support Vector Machines 


The SVMs are groups of learning techniques that are designated to solve prob- 
lems of discrimination, i.e. to decide to which class a pattern belongs, or of 
regression, i.e. to predict the numerical value of a variable. The success of this 
method is defined by solid mathematical bases. The main objective of the SVM 
is like the perceptron principle but it consists not only into finding an hyper 
plan that separates perfectly the classes but also to find the optimal one that 
can separate perfectly the classes by maximizing the margin. They project the 
data in space of characteristics by using non-linear functions. In this space it 
builds the optimal hyper plan which separates the transformed data. The prin- 
cipal idea is to build a linear separation surface in the space of the characteristics 
which corresponds to a non-linear surface in the space entry (Figure 3 presents 
this non-linear transformation). 

The Support Vector Machines approach passes through two steps: The train 
that consists of searching an optimal hyper plan of separation by maximizing 
the margin, with the resolution of a quadratic program and determination of 


Fig. 3. The non linearity. 
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the Lagrange multipliers [14]. The test, after the determination of the Lagrange 
multipliers, it applies the decision function to the test examples to determinate 
the class [14]. The classification is conducted by using the SVM-KM toolbox [15] 
and by considering a Gamma value of 1e-7 and penalty parameter of the error 
term C — 100 while we use Gaussian kernel. 


3 Experimental Results 


Several publicly available large datasets have been used for testing the proposed 
architecture. The CelebFaces Attributes Dataset (CelebA) [1] is a large-scale face 
attributes dataset with more than 200K celebrity images each is notated with 40 
attributes. The images in CelebA dataset include large variations in appearance 
such as pose and background. It contains 10.177 identities having 20 images in 
average. The CelebA does not overlap with LFW dataset identities. We also use 
the LFW dataset [16] which is a large dataset, real-world face dataset consist- 
ing of 13.000 images of faces collected from internet. These images are taken 
in completely uncontrolled situations. This dataset contains variations in pose, 
lighting, expression, camera, imaging conditions. CelebA and LFW were inten- 
sively used in the recent proposed work in the litterature for the aim to predict 
facial attributes. The PubFig dataset [19] has been used to test the sensitivity 
to variations in pose, illumination and facial expression. The PubFig dataset is 
a large, real-world face dataset (including both celebrities and politicians) con- 
sisting of 58797 images of 200 subjects collected from the internet. The PubFig 
dataset is both larger and deeper, on average 300 images per individual than 
previous seen dataset. These images are taken in completely uncontrolled situ- 
ations. This database contains variations in pose, lighting, expression, camera, 
imaging conditions. The PubFig dataset is similar to LFW dataset. However, 
the PubFig dataset has enough examples per each subject. 


Experiment 1: The first experiment consists of the facial attributes prediction 
using the Labeled Faces in the Wild (LFW) and the CelebFaces dataset. Table 1 
and Table2 represent the facial attributes prediction results using CelebA and 
LFW accordingly. In this experiment we propose also to compare our proposed 
system (internal layer) for attributes prediction with the top layers corresponding 
to HMAX, VGG, Alexnet and ResNet-50. Alexnet is a convolutional neural net- 
work that is trained on more than a million images from the ImageNet database 
[19]. The network is 8 layers deep and it can classify the images into 1000 object 
categories [20,21]. The network has an image input size of 227-by-227. ResNet- 
50 is also a CNN framework trained with the images from ImageNet [19]. It is 
50 layers deep with an input size of 224-by-224. But unlike Alexnet, Resnet-50 
layers are organised in residual blocks. With each block encompassing of at-least 
3 CNN layers (1x1; 3x3; and 1x 1 convolutions) followed by a shortcut con- 
nection. VGG is a convolutional neural network that is trained on more than a 
million images from the ImageNet database. The network is 16 layers deep and 
can classify images into 1000 object categories, such as keyboard, mouse, pencil, 
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and many animals. As a result, the network has learned rich feature representa- 
tions for a wide range of images [20]. This experiment is for the aim to compare 
between the proposed system that is based on internal layer and the final layers 
corresponding to the HMAX, VGG, Alexnet and Resnet-50. 


Experiment 2: In the second experiment we propose to compare the obtained 
results on LFW and CelebA faces with some recent obtained results in the state 
of the art such as FaceTracer [17], PANDA [18], LNets--ANet [3] and Shared 
CNN proposed in [8]. Table3 and Table4 represent the results on LFW and 
CelebA datasets respectively. The aim of this experiment is to compare our 
proposed hybrid architecture with recently proposed systems in the litterature 
based on DNNs. 


Experiment 3: In the third experiment, we test the PubFig dataset as it con- 
tains a large number of images per each subject. We study the efficiency of our 
proposed system on pose variation, faces under different illumination and expres- 
sion. As the PubFig dataset contains images from the web a lot of them are not 
available, for this reason we construct a database composed of 54 subjects with 
8942 images. Table 5 represents the obtained results which were compared also 
with Alexnet, VGG and Resnet-50 features. 

From the obtained results (Tablel, Table2), one can see clearly that the 
LFW dataset obtain better results with our hybrid system comparing to the 
CelebA and this is mainly due to the fact that we used CelebA faces in nonuni- 
form order for the identities because the CelebA is composed of thousand of 
identities with twenty images per identity in average, these twenty images are 
dispatched randomly in the data; however, in the LFW database all the images 
corresponding to the same subject are in a successive order. Our proposed system 
demonstrate a comparable efficiency with the pretrained Deep Neural Networks 
(VGG, Alexnet and Resnet-50) in the same time surpass the obtained results 
with the features obtained from the final layer C2 corresponding to HMAX. 
Some facial attributes such as ‘Attractive’, ‘Gray Hair’, ‘Male’, ‘Mustache’ were 
well predicted using our model comparing to the pretrained DCNNs. In addi- 
tion, our proposed biological system shows good performances comparing to the 
proposed models in the state of the art [3,8,17,18] specially on identity facial 
attributes. These identity-based facial attributes such as ‘gender’, ‘hair color’, 
‘nose’ and ‘lips shape’, ‘chubby’ and ‘Blad’ can add meaningful information 
for face identification. Additionally, the proposed hybrid attention-based system 
achieve a comparable results with the final layers of VGG, AlexNet, ResNet-50 
and HMAX. Another advantage is that our model use different locations to pre- 
dict the whole attributes faces. However, LNets-ANets [1] use different layers 
to predict different kind of attributes, also [3] consider the fusion of different 
and specific regions from the face to detect a specific facial attribute. In the 
latest case we may not find all facial regions available specially with different 
poses on mobile devices. Even though the PubFig is very challenging database 
with variation in pose, illumination and expression, our proposed approach can 
distinguish between frontal and dark lighting images. 
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'Table 1. Facial attributes prediction accuracies on LFW. 


Attribute HMAX | VGG | Alexnet | Resnet | Ours 
5 o Clock Shadow 79.5 93 95.5 97.5 95 
Arched Eyebrows 91 98.5 | 98.5 99.5 96 
Attractive 76 95 96 93.5 96.5 
Bags Under Eyes 86 92 95.5 96 91 
Bald 95 99 98 98.5 95 
Bangs 83 96.5 | 97 97.5 90 
Big Lips 66 92.5 | 92.5 89.5 88 
Big Nose 94.5 99.5 |99 99.5 88 
Black hair 99.5 99.5 99.5 99.5 86 
Blond hair 100 100 99.5 100 93 
Blurry 72.5 97 96 97 92 
Brown hair 99.5 100 99.5 99.5 89 
Bushy Eyebrows 84.5 95 96 96 86 
Chubby 96.5 97.5 99.5 99.5 90 
Double Chin 88.5 94.5 | 98.5 99 96 
Eyeglasses 74.5 93 95 95 92 
Goatee 84.5 97.5 | 97.5 97.5 91 
Gray Hair 45.5 32.5 | 27.5 24 91 
Heavy Makeup 99 100 98 99.5 91 
High Cheekbones 72.5 96 96 95 88 
Male 55 56.5 | 57 64 91 
Mouth Slightly Open | 45 46.5 | 46 34.5 91 
Mustache 70 85.5 | 87.5 88 91 
Narrow Eyes 81.5 83 91 88.5 88 
No Beard 85.5 95.5 | 98.5 98.5 91 
Oval face 88 98.5 97.5 97.5 91 
Pale Skin 99 99.5 99.5 100 89 
Pointy Nose 71 93 92 95 90 
Receding Hairline 66.5 89.5 | 91 77.5 90 
Rosy Cheeks 99 99 99 99 91 
Sideburns 80 93.5 | 96.5 96 93 
Smiling 88 96.5 | 96.5 97 94 
Straight Hair 87 95.5 | 98.5 96.5 93 
Wavy hair 75.5 88.5 | 94.5 94.5 89 
Wearing Earings 86.5 96.5 | 96 97 92 
Wearing Hat 68 91 93.5 89.5 88 
Wearing Lipstick 91 99 98.5 97.5 90 
Wearing Necklace 72 93 98 97 88 
Wearing Necktie 99 98 99.5 99.5 93 
Young 100 100 100 100 88 
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'Table 2. Facial attributes prediction accuracies on CelebA. 


Attribute HMAX | VGG | Alexnet | Resnet | Ours 
5 o Clock Shadow 85.4 88.2 88.2 88.2 91.5 
Arched Eyebrows 96.8 77.2 | 77.6 76 60.5 
Attractive 71.2 77.6 | 77.2 75.6 59.5 
Bags Under Eyes 71.8 79.6 | 79.8 78.6 74 
Bald 97.4 97.4 |97.4 97.4 99.5 
Bangs 81.8 87.2 |87 88.8 82.5 
Big Lips 64.6 75.2 | 75.4 74.6 60 
Big Nose 70.4 78 77.4 77 74 
Black hair 69.6 75.6 |75.6 77.6 85 
Blond hair 80.8 86.8 | 84.4 88 80.5 
Blurry 93.8 94 93.6 93.6 99.5 
Brown hair 71.4 78 78.4 77.4 81 
Bushy Eyebrows 78 83.4 | 83.4 83 86 
Chubby 92.6 95.8 | 95.4 95.4 96.5 
Double Chin 94.4 95.6 |95 95.4 97.5 
Eyeglasses 91.4 97.8 | 97.6 96 97.5 
Goatee 91.2 92.4 | 92 92.4 97.5 
Gray Hair 93.6 94.2 | 93.8 94 98 
Heavy Makeup 75.2 82.2 | 80.6 83.8 59.5 
High Cheekbones 62.8 73 68.2 73.8 64.5 
Male 73.6 79.6 | 81 84 83.5 
Mouth Slightly Open | 55.6 62 62.4 66.2 50 
Mustache 94.8 95.6 | 95.6 95.2 96.5 
Narrow Eyes 82.4 89.2 | 89 89.2 87 
No Beard 81.4 84.4 | 83.4 84.6 91 
Oval face 64.4 71.2 71.4 71 94.5 
Pale Skin 95.6 96.8 | 96.2 96.6 98.5 
Pointy Nose 66.6 72.6 | 72.2 71 67 
Receding Hairline 91.2 94.8 | 95 94.6 95.5 
Rosy Cheeks 92.4 94.2 | 94 94.2 85 
Sideburns 92.6 93.6 | 93.2 93.6 96 
Smiling 62.6 71 68.2 75.2 58 
Straight Hair 73.6 81 81.2 81 84 
Wavy hair 66.4 66.8 | 65.8 71.8 64.5 
Wearing Earings 72.4 82 80.6 82.2 57 
Wearing Hat 95 97 96.4 97.4 99 
Wearing Lipstick 75 82.8 | 80.8 85.8 74 
Wearing Necklace 79 86.4 | 85.6 86.2 77 
Wearing Necktie 86.4 91.4 | 91.2 91.4 97.5 
Young 73.6 80 79.6 76.8 81 
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'Table 3. Comparaison of the attribute prediction frameworks on LFW 


Bald | Big Lips | Big Nose | Chubby | Male | Black Hair | Blond Hair 
FaceTracer | 77 68 73 67 84 76 88 
PANDA 84 =| 73 79 69 92 87 94 
LNets-ANet | 88 75 81 73 94 90 97 
Shared CNN |90 |65 75 62 94 80 88 
Our model | 91 88 86 96 91 93 92 


Table 4. Comparaison of the attribute prediction frameworks on CelebA Faces 


Bald | Big Lips | Big Nose | Chubby | Male | Black Hair | Blond Hair 
Facelracer | 89 64 74 86 91 7O 80 
PANDA 96 |67 75 86 97 85 93 
LNets-ANet |98 | 68 78 91 98 88 95 
Shared CNN|97 |67 79 91 99 75 84 
Our model |99.5 | 60 74 96.5 83.5 |85 80.5 


Table 5. Comparison of the attribute prediction on PubFig database 


Pose | Lighting | Expression 
HMAX features | 60 61 50.3 
VGG features 70.40 | 70.7 48.1 
Alexnet features | 70.8 | 69.7 49.7 
Resnet features | 71 69.9 51.5 
Ours 70.9 | 70.9 51.4 


4 Conclusion 


In this paper a visual attention-based system have been proposed to predict 
the facial attributes. This hybrid system allows a biological hierarchical net- 
work ‘HMAX’ to look into a particular salient regions of the input faces in the 
same time reduce the complexity by discarding the irrelevant information. These 
regions were introduced to LBP with the aim of extracting texture feature around 
these interest points extracted with HMAX. This proposed framework shows a 
promising results comparing to Deep Neural Networks architectures. The success 
of the hybrid architecture is due not only to the biological vision perception but 
also to the possibility and flexibility of this approach to learn and treat small 
amount of data and predict different facial attributes. By surpassing these issues 
we can solve the main problem of DCNNs that require large amount of data 
for training. The proposed approach can add a good impact on face recognition 
as it can predict the most challenging real-world scenario (different background, 
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pose variation, illumination, expression) that can degrade significantly the face 
recognition performances. 
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Abstract. The exchange of ideas between computer science and sta- 
tistical physics has advanced the understanding of machine learning 
and inference significantly. This interdisciplinary approach is currently 
regaining momentum due to the revived interest in neural networks and 
deep learning. Methods borrowed from statistical mechanics complement 
other approaches to the theory of computational and statistical learning. 
In this brief review, we outline and illustrate some of the basic concepts. 
We exemplify the role of the statistical physics approach in terms of a 
particularly important contribution: the computation of typical learn- 
ing curves in student teacher scenarios of supervised learning. Two, by 
now classical examples from the literature illustrate the approach: the 
learning of a linearly separable rule by a perceptron with continuous 
and with discrete weights, respectively. We address these prototypical 
problems in terms of the simplifying limit of stochastic training at high 
formal temperature and obtain the corresponding learning curves. 


1 Introduction 


At least two major developments have led to the regained popularity of machine 
learning in general and neural networks in particular [1-6]. Most importantly, the 
ever-increasing availability of training data from various domains and contexts 
have made possible the training of very powerful systems such as deep neural 
networks [4-6]. At the same time, the computational power necessary for the 
data driven adaptation and optimization of such systems, has become available. 

Several concepts that had been developed earlier, some of them even decades 
ago, could be realized and applied successfully in practice only recently. Examples 
and further references can be found in [4-6]. In addition, novel computational 
techniques and important modifications of the considered systems have con- 
tributed to this success. This includes the use of pre-trained networks, sophisti- 
cated regularization techniques, weight sharing in convolutional neural networks, 
or the use of alternative activation functions [4-8]. 

While the relevance and success of the methods are widely recognized, sev- 
eral authors note that the theoretical understanding does not yet parallel the 
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practical advances of the field, see for instance [9-13] in the context of deep 
learning. It is certainly desirable to strengthen and put forward the theoreti- 
cal investigation of machine learning processes in general and deep learning in 
particular. The development of novel concepts and the design and optimization 
of practical training prescriptions would greatly benefit from better theoretical 
understanding. This concerns, for instance, mathematical and statistical foun- 
dations, the dynamics of training, and insights into the expected generalization 
ability of learning systems. 

Concepts borrowed from statistical mechanics have been applied in many 
areas beyond the scope of traditional physics. In particular, analytical and com- 
putational approaches developed for the study of complex physical systems can 
be exploited within computer science and statistics. A prominent example is 
the use of Markov chain Monte Carlo methods [14], which exploit mathematical 
analogies between stochastic optimization and the statistical physics of systems 
with many degrees of freedom. Similarly, analytical methods which had been 
developed for the analysis of disordered systems [15], have been applied in this 
context. 

A somewhat surprising and very inspiring analogy was pointed out by John 
Hopfield [16]: the conceptual similarity of simple dynamical neural networks with 
models of disordered magnetic materials [15]. It attracted considerable interest 
in neural networks and related systems within the physics community. Initially, 
the analysis of thermal equilibrium states in so-called attractor neural networks 
was in the center of interest [1,16,17]. However, the same concepts were applied 
successfully to the investigation of learning and synaptic plasticity in neural 
networks. Elizabeth Gardner's pioneering work [18,19] paved the way for the 
theory of learning in a large variety of machine learning scenarios, including the 
supervised training of feedforward neural networks, methods of unsupervised 
data analysis, and more general inference problems, see [20-22] for reviews. 

A variety of analytical tools and modelling frameworks have been developed 
and applied successfully to, for instance, the study of supervised learning in the 
context of regression and classification tasks. Mostly, relatively simple and shal- 
low feedforward neural networks have been analysed [20-22]. Frequently, these 
training processes are modelled in the frame of a student and teacher scenario. 
'There, a specific neural network, the teacher, is assumed to define the target task, 
e.g. a classification scheme. A student network is trained from a set of exam- 
ples provided by the teacher and parameterizes a data driven hypothesis about 
the target rule. This allows to explicitly control the complexity of the target 
rule and of the learning system in the model. Moreover the performance of the 
trained system can be quantified in terms of its similarity and agreement with 
the teacher network. Learning can be interpreted as the stochastic optimization 
of many degrees of freedem, which motivates possible training algorithms based 
on statistical mechanics ideas. Also, analytical tools for the study of large sys- 
tems in (formal) thermal equilibrium situations can be used, which describe the 
model in terms of a few macroscopic quantities, only. Frequently, these so-called 
order parameters appear naturally when analysing student teacher scenarios. 
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Ultimately, methods developed in the theory of disordered systems allow for 
the investigation of typical properties of the learning system. This concerns, for 
instance, the computation of learning curves as an outcome of the stochastic 
training process on average over the assumed randomness in the example data. 

'The successful applications of these concepts include, among other relevant 
topics, the highly interesting phenomenon of symmetry breaking phase transi- 
tions which result in discontinuous learning curves: Frequently, the success of 
training is found to depend critically on the number of available examples or 
other model parameters [20-25]. Currently, the interest in this type of analysis 
is gaining momentum again in the context of deep learning and other popular 
learning paradigms, see [26-31] for examples. 

In the following section, we briefly outline and illustrate the statistical physics 
of student teacher scenarios in supervised learning. We present two variants of 
a simple and by now classical example: the learning of a linearly separable rule 
with a perceptron network with continuous or discrete weights, respectively. The 
perceptron has been discussed extensively in the literature and serves as a pro- 
totypical system for the understanding of machine learning processes, see e.g. 
[1, 20-22]. For the sake of brevity, we focus on a particularly simplifying approach, 
the consideration of stochastic training in the limit of high (formal) tempera- 
ture. It was introduced and applied to perceptron training in [22]. Despite its 
conceptual simplicity and mathematical ease, this example illustrates the basic 
concepts very well and yields non-trivial results and insights into the learning 
process. 

This contribution is based on a tutorial talk at the Workshop on Brain 
Inspired Computing, BrainComp 2019. It is obviously far from providing a com- 
plete overview of the statistical physics of learning. The intention is to attract 
the reader's attention in terms of selected example applications of the approach 
and to provide references as a starting point for further exploration of this highly 
relevant area of research. 


2 Statistical Physics of Learning: Learning Curves 


Typically, the statistical physics based computation of learning curves in super- 
vised learning proceeds along the following steps: 


1) A student and teacher scenario is defined, which parameterizes the target rule 
and fixes the complexity of the student hypothesis. 

2) It is assumed that training examples and test instances are generated accord- 
ing to a specific input density, while target labels are provided by the teacher 
network. 

3) The study of large systems in the thermodynamic limit allows to describe 
systems in terms of relatively few macroscopic quantities or order parameters. 

4) The outcome of stochastic training processes is interpreted as a formal ther- 
mal equilibrium, in which thermal averages can be considered. 
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5) An additional disorder average over a randomly generated set of training 
data is performed in order to obtain typical results independent of the actual 
training set. 


'The following sections illustrate the above points in the context of learn- 
ing a linearly separable rule [20-22], before two concrete example scenarios are 
analysed in Sect. 2.6. 


2.1 Learning a Linearly Separable Rule: Student and Teacher 


We consider the supervised learning of a linearly separable classification of N- 
dimensional data. In our model, the target rule is defined through a teacher 
perceptron with fixed weight vectors w* € IR" and output 


S*(£) = sign [w* -£] = +1 for any £c RN. (1) 


Here, the feature vector € represents N numerical inputs to the system and 5S* 
corresponds to the correct output. The teacher weight vector parametrizes an 
(N — 1)-dim. hyperplane which separates positive from negative responses. 

We note that the norm |w*| of the weights is irrelevant for the perceptron 
response (1). Throughout the following, we therefore consider normalized teacher 
weights with w* -w* = N. 

In the learning scenario, information about the rule is only available in the 
form of a data set which comprises P examples: 


D = {6", S (E) uds gs (2) 


Here we assume that the labels S*^ = S*(£") provided in D are reliable and 
represent the rule (1) faithfully. We refrain from considering corruption by dif- 
ferent forms of noise, for simplicity, and refer the reader to the literature for the 
corresponding extensions of the analysis [20,21]. 

A second simple perceptron serves as the student network in our model. Its 
adaptive weights w € R parameterize a linearly separable function 


S(£) = sign [w - £]. (3) 


'The weight vector w is chosen in a data-driven training process which is based 
on the available data D and corresponds to the student hypothesis about the 
unknown target. As a consequence of the invariance 


sign | (Aw) - &] = sign| w - €] for arbitrary \ > 0 


we will also consider normalized student weights with w-w = N in the following. 


2.2 The Density of Input Data 


In realistic learning situations it is expected that the density of input features 
is correlated with the actual task to a certain extent. In real world classifica- 
tion problems, for instance, one would expect a more or less pronounced cluster 
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structure which reflects the class memberships already. Clustered or more gen- 
erally structured input densities have been considered in the statistical physics 
literature, see [26] for a recent discussion and further references. Here, however, 
we follow the most frequent approach and resort to the simplifying assumption 
of an isotropic input density which generates input vectors independently. In 
a sense, this constitutes a worst case in which the only information about the 
target rule is contained in the assigned training labels S*(€), while no gap or 
region of low density in feature space marks the class boundaries. 

Specifically, we assume that components of example vectors £^ in D consist 
of independent, identically distributed (i.i.d.) random quantities with means and 
covariances 


with the Kronecker symbol 9,,,, 21 if mz n and dmm=0. 


2.3 Generalization Error and the Perceptron Order Parameter 


The performance of a given weight vector w in the student teacher model can 
be evaluated with respect to a test input £ ¢ D. If we assume that the test input 
follows the same statistics as the training examples, i.e. 


(6j) = 0,  (&j&x) = dyn, (5) 
we can define the so-called generalization error as the expectation value 
" - " lifSz.S* 
ejr") = (e(S(E,8"(6))) where «($,5) a (6 


serves as a binary error measure. Hence, the generalization error quantifies the 
probability for disagreement between student and teacher for a random input 
vector. It is instructive to work out ej explicitly under the assumption of i.i.d. 
inputs. To this end, we consider the arguments of the threshold function in 
student and teacher perceptron: 


r—w-E/VN and a* =w*-€/VN. 


Assuming that the random input vector € satisfies Eq. (5), x and x* corre- 
spond to sums of N random quantities. By means of the Central Limit Theorem 
(CLT) there density is given by a two-dim. Gaussian, which is fully specified by 
first and second moments. These can be obtained immediately as 


(=w= (%)= Ypww GG) = =a, (ny) S =1 


and (r2*)= gd (Eib) = E =R, (7) 


where we have exploited the nc of weight vectors. The covariance 
(xx*) is given by the scalar product of student and teacher weights. The moments 
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(7) fully specify the two-dimensional normal density P(x, x*) and we obtain the 
generalization error as the probability of observing xa* < 0: 


€eg(w, w^) = n EA f] P(z,a*)dxdz* = L arccos(R). (8) 


This result can be obtained immediately by an intuitive argument: The probabil- 
ity for a random vector € to fall into the hypersegments between the hyperplanes 
defined by w and w* is directly given by Z (w, w*) /m which corresponds to the 
right hand side of Eq. (8). 

In the following, the overlap R = w-w*/N plays the role of an order param- 
eter. This macroscopic quantity summarizes essential properties of the N micro- 
scopic degrees of freedom, i.e. the adaptive student weights wj. It is also the 
central quantity in the following analysis of the training outcome. 


2.4 "Training as a Stochastic Process and Thermal Equilibrium 


'The outcome of any practical training process will clearly depend on the actual 
choice of an algorithm and its parameters that is used to infer a suitable weight 
vector w from a given data set D. Generically, the training process is guided by a 
cost function, such as the quadratic deviation of the student output from the tar- 
get in regression systems or the number of incorrect responses in a classification 
problem. 

Frequently, gradient based methods can be used for the optimization of con- 
tinuous weights w € IR^, often incorporating some form of noise as in the popular 
stochastic gradient descent. The search for optimal weights in a discrete space 
with, e.g., w € {—1, +1} could be performed by means of a Metropolis Monte 
Carlo method, as an example. 

'The degree to which the system is forced to approach the actual minimum of 
the cost function is controlled implicitly or explicitly in the training algorithm. 
Example control parameters are the learning rate in gradient descent or the 
temperature parameter in Metropolis like schemes. In the statistical physics 
approach to learning, this concept is taken into account by considering a formal 
thermal equilibrium situation as outlined below. 

In the context of the perceptron student teacher scenario we consider a cost 
function of the form 

P 
H(w) = »» e(S", S") with S" —sign|w-£"], S** =sign[w*-€"]. (9) 


n 


With the binary error measure of Eq. (6), the cost function represents the number 
of disagreements between student and teacher for a given data set. 

Without referring to a particular training prescription we can describe the 
outcome of suitable stochastic procedures in terms of a Gibbs-Boltzmann density 


of weight vectors 
—BH(w) 
P.q(w) = —— with Z — J du(w) e-88 GO. (10) 
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It describes a canonical ensemble of trained networks in thermal equilibrium at 
formal inverse temperature 3 = 1/T. The cost function E(w) plays the role of 
the energy of state w and the normalization Z is known as the partition function. 
The measure du(w) is implicitly understood to incorporate restrictions of the 
N-dimensional integration such as the normalization w? — N. Similarly, Z can 
be written as a sum over all possible weight configurations for systems with 
w € (21,41). 

In the limit 9 — oo, T — 0, only the groundstate with minimal energy can be 
observed in the ensemble, as any other state will have an exponentially smaller 
Peq. On the contrary, for 6 — 0,T' — oo, the energy becomes irrelevant and 
every state Peg can occur with the same probability. In general, the parameter 
B controls the mean energy of the system which can be written as a thermal 
average of the form 


e- 8H w) 
(H); = J so teo z = AA (11) 


Quite generally, thermal averages can be written as appropriate derivatives of 
the so-called free energy F = E In Z, which is also in the center of the following 
analysis. Introducing the microcanonical entropy S(E) we can rewrite 


Z= J dEe PF+S() where S(E) = In / du(w) ó[H(w) — E] (12) 


with the Dirac delta-function ó[...]. For large systems in the thermodynamic 
limit N — oo we assume that entropy and energy are extensive, i.e. that S = N s 
and E = Ne with e, s = O(1). A saddle-point integration yields 


Jim (-InZ/N) = BF/N = be- s(e) (13) 
where the right hand side GF/N has to be evaluated in its minimum with respect 
to e for a given jf. 


2.5 Disorder Average and High-Temperature Limit 


'The consideration of a formal thermal equilibrium in the previous section refers 
to a particular data set D, since the energy function H (w) is defined with respect 
to the given example data. In order to obtain typical results independent of 
the particularities of a specific data set, an additional average over randomly 
generated D has to be performed. 

In the simplest case, we consider data sets which comprise P independent 
vectors £^ with i.i.d. components that obey (4). Hence the corresponding density 
factorizes over examples u = 1,2,... P and components j = 1,2,... N of the 
feature vectors in ID. 

The randomness in D can be interpreted as an external disorder which deter- 
mines the actual energy function H(w) and the corresponding thermal equilib- 
rium. In addition to the thermal average discussed in the previous section, the 
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associated quenched average is denoted as (...)p. Quantities of interest have to 
be studied in terms of appropriate averages of the form ((...)g)p which can be 
derived from the quenched free energy 


Um — (n Z)p 7n. 


The computation of (In Z); is, in general, quite involved and requires the appli- 
cation of sophisticated methods such as the replica trick [1,15, 20-22]. 

We refrain from discussing the conceptual difficulties and mathematical sub- 
tleties of the replica approach. Instead we resort to a very much simplifying limit, 
which has been presented and discussed in [22]. In the extreme setting of learning 
at high formal temperature with @ — 0, the so-called annealed approximation 


(In Z)p = In (Z)g 


becomes exact and can be exploited to obtain the typical training outcome [20- 
22]. Note that in this limit also 


(H(w))p = M (as, s») = Pe, )n. (14) 


Here we make use of the fact that the i.i.d. random examples in D contribute 
the same average error which is given by eg. It is expressed as a function of the 
order parameter R in Eq. (8). We can now perform a saddle point integration in 
analogy to Eqs. (12, 13) to obtain 


: BP 
Jim C In(Zis/N) = 8(F)o/N = 5e (R) — s(R). (15) 
Again, the right hand side has to be evaluated in its minimum, now with respect 
to the characteristic order parameter R of the system. The entropy term 


s(R) = HESS -w*— NR] (16) 


can be obtained analytically by an additional saddle point integration making use 
of the integral representation of the ó-function [1, 20-22]. Since s(R) depends on 
potential constraints on the weight vectors as represented by du(w), we postpone 
the computation to the following sections. 

In order to obtain meaningful results from the minimization with respect to 
R in Eq. (15), we have to assume that the number of examples P scales like 


P=aN/6 with a = O(1). (17) 


Obviously, P should be proportional to the number N of adaptive weights in 
the system, which is consistent with an extensive energy. In addition, P has to 
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Fig. 1. The quenched free energy 8f as a function of the order parameter R in the 
training scenario with spherical student and teacher perceptron, cf. Sect. 2.6. From left 
to right, the rescaled numbers of examples are a = 1.0,3.0 and 5.0, respectively. 


grow like 97! in the high temperature limit. The weak role of the energy in 
this limit has to be compensated for by an increased number of example data. 
In layman's terms: “Almost nothing is learned from infinitely many examples”. 
'This also makes plausible the identification of the energy with the generalization 
error. The space of possible input vectors is sampled so well that training set 
performance and generalization behavior become indistinguishable. 

Finally, the quenched free energy per weight, f = (F)p/NN of the perceptron 
model in the high temperature limit has the form 


Bf = oeg(R) — s(R), (18) 


where a plays the role of an effective temperature parameter, which couples the 
number of examples and the formal temperature of the training process. These 
quantities cannot be varied independently within the simplifying limit 9 — 0 in 
combination with P/N x (9-1. 


2.6 Two Concrete Examples 


Despite the significant simplifications and scaling assumptions, it is possible to 
obtain non-trivial, interesting results also in the high temperature limit. Very 
often, more sophisticated approaches, such as the replica method or the annealed 
approximation for finite training temperatures, confirm the results for 8 — 0 
qualitatively. Therefore, the simplified treatment has often been used to obtain 
first, useful insights into the qualitative properties of various learning scenarios. 
In this brief review, we restrict the discussion to two well-known results for 
simple model situations. Both concern the training of a simple perceptron in a 
student teacher scenario. Originally the models were treated in [22] and they have 
been revisited in several reviews, for instance, [20,21]. We reproduce the results 
here as particularly illustrative examples for the statistical physics approach to 
learning. 
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Fig. 2. Typical learning curves of the perceptron with continuous weights in the student 
teacher scenario, see Sect. 2.6. The left panel shows R(a), the right panel displays the 
corresponding generalization error eg(a). 


The Perceptron with Continuous Weights 
Here we consider a student teacher scenario where the student weight vector 
w € R is normalized (w? = N) but otherwise unrestricted. 

The generalization error as a function of the student teacher overlap R is 
given in Eq. (8). The corresponding entropy, Eq. (16), can be obtained by means 
of a saddle point integration. Alternatively, one can interpret e" s as the volume 
of an (NV — 1)-dimensional hypersphere in weight space with radius v1 — R?, see 
[21] for the geometrical argument. One obtains 


1 
s(R) — 3 In(1— R?) + const., (19) 


where the additive constant does not depend on R. Apart from such irrelevant 
terms, we obtain the quenched free energy in the limit 8 — 0 as 


1 1 
Bf = a—arccos R — 5 In(1 - R’). (20) 
T 


In absence of training data, a = 0, the maximum of the entropy term in R = 0 
governs the behavior of the system. In the high-dimensional feature space, the 
student weight vector is expected to be orthogonal to the unknown w*. 

The free energy is displayed in Fig. 1 for three different values of a. As a 
is increased, we observe that the minimum of Of is found in larger, positive 
values of R, reflecting the knowledge about the rule as inferred from the set of 
examples. 

The student teacher overlap R(o) that corresponds to the minimum of f is 
displayed in Fig. 2 (left panel). In this simple case, it can be obtained analytically 
from the necessary condition for the presence of a minimum: 


obf _ = a 


By means of Eq. (8) this result translates into a learning curve €,(a@), which is 
shown in the right panel of Fig. 2. One can show that large training sets facilitate 
perfect generalization with 
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Fig. 3. The quenched free energy Bf as a function of the order parameter R in the 
training scenario with Ising student and teacher perceptron, cf. Sect. 2.6. In the leftmost 
panel the rescaled numbers of examples is a = 1.5 < o.c, where R = 1 constitutes a 
local minimum while a state with 0 < R < 1 is thermodynamically stable. In the center 
panel with o = 1.8 > o, here perfect generalization with R = 1 corresponds to the 
global minimum. The rightmost panel displays Gf for a = 2.2 > aq where R = 1 
constitutes its only minimum. 


2 


1 
and egla) ~ — for a— oo. (22) 
a 
It is interesting to note that the basic asymptotic a-dependences are recov- 
ered in the more sophisticated application of the annealed approximation or the 
replica formalism [22]. Obviously, an explicit temperature dependence and the 
correct prefactors cannot be obtained in the simplifying limit. 


The Perceptron with Discrete Weights 
As an interesting exercise we also revisit the model with discrete student weights 
[22]. The term Ising perceptron has been coined for the model with weights 
w € (-1,1)" [21,22]. Note that the assumed normalization w? = N is trivially 
satisfied. Moreover, the generalization error is also given by Eq. (8) since its 
derivation does not depend on details of the weight space. 

The corresponding entropy can be obtained by a simple counting argument: 
In order to obtain an overlap )), wjw; = NR, a number of N(R + 1)/2 com- 
ponents must satisfy wj = w; while for N(R — 1)/2 we have wj = —w;j. The 
associated entropy of mixing is given by the familiar form 


sm = - (5) (5) EE (23) 


The resulting free energy (18) as a function of R is displayed in Fig. 3 for three 
different values of a. 

For alla > 0, 8f displays a local minimum in R = 1 with f(R = 1) = 0. For 
small a, however, a deeper minimum can be found with an overlap 0 « R « 1. 
This is exemplified for œ = 1.5 in the leftmost panel of Fig. 3. The global mininum 
of Gf determines the thermodynamically stable state of the system. 
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Fig. 4. Learning of the Ising perceptron with discrete weights in the student teacher sce- 
nario, see Sect. 2.6. The left panel shows R(o), the right panel displays the correspond- 
ing generalization error eg(a). States corresponding to local minima of 8f are marked 
by dashed lines, while solid lines mark the thermodynamically stable global minima. 
Vertical dotted and solid lines correspond to the critical œe © 1.69 and aa e 2.08, 
respectively. 


For training sets with a larger than a critical value a, ~ 1.69, the state with 
R = 1 constitutes the global minimum. A competing configuration with R < 1 
persists as a local minimum, but becomes unstable for o > ag 7 2.08, see the 
center and rightmost panel of Fig. 3. 

The learning curves R(a) and ej(a) reflect the specific a-dependence of 6f 
in terms of a discontinuous phase transition. In Fig. 4, the solid lines mark the 
thermodynamically stable state in terms of R(a) (left panel) and e4(o) (right 
panel). Dashed lines correspond to local minima of Gf and the characteristic 
values a, and o4 are marked by the dotted and solid vertical lines, respectively. 

'The essential findings of the high temperature treatment do carry over to 
the training at lower formal temperatures, qualitatively [21,22]. Most notably, 
the system displays a freezing transition to perfect generalization. Furthermore, 
the first order phase transition scenario will have non-trivial effects in practical 
training. The existence of metastable, poorly generalizing states can delay the 
success of training significantly. Related hysteresis effects with varying o have 
been observed in Monte Carlo simulations of the training process, see [21] and 
references therein. 


3 Summary and Conclusion 


'This brief review merely discusses one goal of the statistical physics of learning: 
the computation of typical learning curves in clear-cut model scenarios. This 
type of results provide basic insight into relevant mechanisms and phenomena 
which play a role in practical machine learning setups as well. The framework 
provides a workshop in which to analyse, put forward and optimize training 
algorithms. Moreover it offers the possibility to systematically compare different 
adaptive systems, network architectures etc. 

The classical examples discussed in this short tutorial concern merely the 
simplest models, i.e. the learning of a linearly separable rule with a percetpron 
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network. The presentation is furthermore restricted to the particularly simplify- 
ing limit of training at high temperature. 

In the literature, numerous studies of more complex adaptive systems, such 
as layered neural networks or support vector machines can be found. Similarly, 
models of unsupervised learning and related problems of data analysis and infer- 
ence have been analysed. Among the many interesting extensions, we mention 
only the study of symmetry breaking phase transitions in feedforward layered 
neural networks. 

The analysis of more realistic training at low formal temperatures requires 
a much more involved mathematical treatment. A thorough discussion thereof 
would be clearly beyond the scope of this brief introduction to the field. Indeed, 
the theory of learning has had a very fruitful impact on the development and 
understanding of sophisticated methods for the analysis of disordered systems 
in general. 

Apart from the equilibrium approach discussed here, statistical physics also 
provides the tools to analyse non-equilibrium situations. This has helped to study 
the dynamics of learning in a very similar fashion. The resulting insights directly 
link to popular practical training prescriptions such as the popular stochastic 
gradient descent. 

Recently, the statistical physics of learning is being rediscovered and has 
gained popularity again in the context of deep learning. A better theoretical 
understanding of this successful machine learning framework is highly desirable. 
Currently, many researchers revisit the statistical physics perspective to learning, 
aiming a fundamental insights into design and performance of, for instance, deep 
layered networks. A brief discussion of recent developments, challenges and open 
questions, as well as further references can be found in [32]. 

The author is convinced that the revival of the area will contribute signif- 
icantly to the further development of machine learning and data analysis in 
general. 
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Abstract. In the last decade, Sentiment Analysis and Affective Com- 
puting have found applications in different domains. In particular, the 
interest of extracting emotions in healthcare is demonstrated by the vari- 
ous applications which encompass patient monitoring and adverse events 
prediction. Thanks to the availability of large datasets, most of which are 
extracted from social media platforms, several techniques for extracting 
emotion and opinion from different modalities have been proposed, using 
both unimodal and multimodal approaches. After introducing the basic 
concepts related to emotion theories, mainly borrowed from social sci- 
ences, the present work reviews three basic modalities used in emotion 
recognition, i.e. textual, audio and video, presenting for each of these 
i) some basic methodologies, ii) some among the widely used datasets 
for the training of supervised algorithms and iii) briefly discussing some 
deep Learning architectures. Furthermore, the paper outlines the chal- 
lenges and existing resources to perform a multimodal emotion recog- 
nition which may improve performances by combining at least two uni- 
modal approaches. architecture to perform multimodal emotion recogni- 
tion. 


Keywords: Sentiment analysis - Affective computing - Text mining - 
Data mining * Neuroscience - Hardware accelerators 


1 Introduction 


Affect is a term widely used in psychology to describe the psychophysiological 
response to stimuli, as for example emotions, while the word sentiment refers to 
a more organized and highly socialized feeling, which can be summarized with 
the expression “an opinion colored by emotion" [1]. 

Several are the studies showing how affective phenomena may influence dif- 
ferent human cognitive processes such as the mechanisms of attention and infor- 
mation processing, as well as the processes of judgment and decision making [2] 
and communication. On the other hand, it is well known how many pathological 
conditions, as for example mood disorders, are characterized by a distorted and 
inconsistent emotional state [3,4]. 
© The Author(s) 2021 
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In humans the ability to be aware, to express and to recognize their own 
and other's emotions through communicative processes is more developed than 
in other animals. This outstanding capacity reflects the additional hierarchical 
levels of processing. Levels that allow learning, inference and simulation [5] and 
which may constitute a crucial aspect in emulating human intelligence [6]. The 
development of computational models capable of mimic the natural ability to 
identify emotional states from speech, facial expressions and written messages is 
a challenging topic which is gaining particular interest in recent years. 'The main 
areas for the computational study related to the recognition of emotions are 
Affective Computing and Sentiment Analysis. Historically, Affective Computing 
relates to the area of Artificial Intelligence (AT) that aims at the development of 
systems capable of recognizing, interpreting and simulating emotions understood 
in the meaning of affect, and therefore the detection of affective human states is 
more focused on different biosignals, while Sentiment Analysis aims at extracting 
opinions and emotions in the sense of sentiment, and was mainly focused on 
textual sources. 

As a result of the popularity of social platforms, the availability of heteroge- 
neous content related to opinions and emotions is increasingly growing, offering 
the possibility to collect and merge the multimodal information for knowledge 
extraction. Since communication among human is a mix of verbal and nonverbal 
content, a system able to measure the emotional state of a person would take 
advantage from a multimodal approach. For this reason, lately different efforts 
in performing multimodal emotion recognition have been made. 

Even though significant AI advancements, the topic continues to pose numer- 
ous open challenges both in the field of research and in large industrial sectors, 
due to the significant impact on marketing strategies [7], recommender systems 
[8] and, recently, also in the medical and psychological field for the development 
of diagnostic and therapeutic clinical decision support systems [9,10]. 

The present work aims to provide a general overview of the technologies and 
methodologies involved in recognition of emotional states and to give an insight 
into the opportunity and challenges for developing an emotion recognition system 
that merges different modalities. The paper is organized as follows: in Sect. 2 the 
fundamental and most accepted approaches for the scientific study of emotions 
in psychological, cognitive and social science are presented. Section3 is devoted 
to discuss the basic unimodal approaches used in emotion recognition, and also 
to present some existing datasets and used tools. Section 4 discusses the use of 
Deep Learning (DL) in emotion recognition. In Sect.5 the opportunities and 
challenges which arises in developing a multimodal emotion recognition system 
are outlined. Finally, Sect. 6 concludes the paper. 


2 Emotion Theories 


'The difficulty of defining emotions in scientific terms is a well-known problem in 
the history of psychology. Currently, there is no scientific consensus on the defini- 
tion of emotion, but there are several heuristic theories which can be grouped into 
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two different viewpoints: the “basic emotions" and the “appraisal” approaches[2]. 
Despite the variability of the emotional responses, which may be linked to sub- 
jective, cognitive and cultural differences, basic emotions models presuppose the 
existence of universal and easily recognizable psychophysiological states. 

The main assumption underlying the basic emotions approach [11,12], is that 
emotions belonging to the same class may vary in intensity or other dimensions, 
but share not only comparable causes and stimuli of responses but also have a 
biological analogy, as for example similar behavioral patterns, bodily activations 
and facial expressions. This strands tends to identify this set of similar emo- 
tions with the most prototypical one (for example “joy” contains “happiness”, 
“enjoyment”, “pleasure”, “joyfulness”, “ecstasy”, “thrill” etc.) [13]. 

On the other hand, appraisal approaches [14-16] assume that emotions are 
triggered by a physiological response which, unlike the basic emotions approach, 
derives from an interpretation of the specific situation through personal criteria 
[13]. With respect to the identification and classification of emotions, there is a 
tendency to differentiate between discrete and dimensional theories of emotions. 
A short description of these models will be given below. 


2.1 Discrete Theories of Emotions 


'The models of discrete emotions are generally the most used in the field of Affec- 
tive Computing, especially as regards the recognition of emotions from facial 
expressions. Numerous models of discrete basic human emotions exist, differ- 
ing among each other in the number and the type of identified emotions. Some 
widespread criteria to identify basic emotions from non-basic ones are: 1) a uni- 
versally recognizable facial expression, 2) a rapid spontaneous and automatic 
recognition, 3) a unique feeling [11]. To date, the most accredited models of 
discrete emotions in the scientific community refer to the theories of Arnold, 
Ekman, Izard, Oatley and Johnson-Laird, Plutchik and Tomkins. Table1 sum- 
marizes the list of basic emotions for each of these theories. 


Table 1. Emotions definitions. 


Categorical emotion theory Identified emotions 

Arnold [17] Anger, aversion, courage, dejection, desire, 
despair, fear, hate, hope, love, sadness 

Ekman [11] Anger, disgust, fear, joy, sadness, surprise 

Izard [12] Anger, contempt, disgust, distress, fear, guilt, 


interest, joy, shame, surprise 


Oatley and Johnson-Laird [18] | Anger, disgust, anxiety, happiness, sadness 


Plutchik [19] Acceptance, anger, anticipation, disgust, joy, 
fear, sadness, surprise 


Tomkins [20] Anger, interest, contempt, disgust, distress, fear, 
joy, shame, surprise 
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Among the theories summarized in the Table 1, the one proposed by Oatley 
and Johnson-Laird [18] has the difference that the existence of basic emotions 
not only has biological foundations but also a semantic component. The authors 
address basic emotions as “semantic primitives”, which means that humans know 
they are feeling a particular emotion X but they don't know how to define it. 

In [21], an interesting work related to the evaluation of emotion theories for 
computational purposes is carried out. In particular, starting from a corpus of 
over 21,000 tweets, six basic emotions theories were analyzed through an iterative 
clustering algorithm based on a variant of Latent Semantic Analysis to discern 
which one has the most semantically distinct set of emotions. Results showed 
that Ekman's model, which is the most popular in Affective Computing, is the 
one in which emotions are more semantically distinct. Then, Bann et al. [21] also 
considered 21 emotions given by joining all the six different models and extracted 
the optimal semantically separated basic emotion set which was proposed as a 
new model of basic emotions consisting of eight emotions: Accepting, Ashamed, 
Contempt, Interested, Joyful, Pleased, Sleepy, Stressed. 


2.2 Dimensional Emotional Models 


Although the most accredited paradigm in the field of neuroscience research states 
that emotions can be divided into discrete and independent categories, numerous 
dimensional models were proposed in the literature. In dimensional theories take 
as and assumption that all the affective states derive from common neurophysio- 
logical systems. Consequently, every emotion can be expressed as a combination 
of these systems, also addressed as dimensions. Only a few are the dimensional 
theories widely accepted by the scientific community, described below. 

The complex model of emotions of Russell [22] is one of the first and most 
well-known dimensional models. The complex model identifies two independent 
dimensions: valence and arousal, represented as the two dimensions of a plane in 
which 28 emotions are mapped in a circle. The arousal-nonaraousal scale mea- 
sures the intensity of emotion and constitutes the vertical axis of the represen- 
tation system. The points belonging to the upper semicircle are characterized 
by high arousal, while the points belonging to the inferior semicircumference 
are characterized by low arousal. Valence is measured by a pleasure-displeasure 
scale, which measures the pleasure of an emotion. On the left semicircumfer- 
ence, unpleasant emotions are represented, while on the right semi-circumference 
pleasant emotions are shown [19]. 

Furthermore there are hybrid models, for example the Plutchik model is an 
example of a model that merges the categorical and dimensional approaches. In 
fact the Plutchik basic emotions represented in Table 1 only refer to the primary 
Plutchik’s emotions. In fact, in his model, named “Wheel of emotions", affective 
states are represented in a structural and circocentric way. The proximity to 
the center represents a greater intensity, while the eight dimensions identified 
are visually represented as eight sectors inscribed in the circocentric structure 
and arranged as four pairs of opposites: Joy-Sadness, Fear-Anger, Anticipation- 
Surprise, Disgust- Trust. 
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3 Basic Unimodal Emotion Recognition Approaches 


'The focus of this Section is to present the three most used modalities for the 
recognition of emotions: text, images and audio. For each of the modality, the 
main tasks, the existing approaches, the available datasets and the function- 
alities of some existing tools in the literature will be presented. We term those 
methods as “unimodal” because each one uses only a type of input data to detect 
emotions. 


3.1 Emotion Recognition from Textual Sources 


As stated in Sect. 1, in general extracting opinions and emotions from textual 
content refers to the area known as Sentiment Analysis which may be seen as 
the intersection of statistical methods, Machine Learning, Information Retrieval 
and Natural Language Processing. 

In the foundational works of Liu, a computational definition of emotion end 
sentiment is presented [23] and to date it is widely accepted among the Sentiment 
Analysis research community. 

According to his definition, an opinion is a quintuple (entity, entity's feature, 
sentiment, opinion holder, time) and the most basic Sentiment Analysis task is 
polarity detection, whose main goal is to detect whether a text unit contains a 
positive, negative, or neutral opinion, and/or also considering a “valence” score, 
indicating how strongly T is positive or negative. Valence scores can be expressed 
as a nominal (strong negative, weak negative, weak positive, strong positive), or 
as a continuous variable, frequently belonging to a specific range (for example 
[-1, 1]). Similarly, an emotion can be seen as a quintuple in which sentiment is 
replaced with an emotion type. More in details, a categorical emotion type may 
be expressed as the couple (emotion class, emotion intensity), where: 


e the emotion class indicates the class to which the specific emotion belongs 
w.r.t. a given system of basic emotions representation; 

e the emotion intensity represents a “valence” score, indicating how strongly 
the emotion is expressed in the given text unit. 


When considering a discrete theory of emotions, a basic emotion detection 
task in SA can be seen as the multiclass classification problem that, given an 
input text T and a list E = |E,- +- , Ex] of basic emotions classes, aims at 
detecting whether the text T contains one emotion £; for i = 1---k and, even- 
tually extracting the respective valence score v;. However, as will be shown in the 
following Subsection, existing annotated emotion datasets do not contain only 
basic emotions and therefore the classification problem usually is designed as a 
multiclass and multilabel problem. If instead a dimensional theory of emotions is 
taken into account, a basic emotion detection task may be seen as the regression 
problem of assigning valence or arousal values to a text T', on the basis of its 
content. 
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A SA process generally follows a standard Text Mining process. Input data 
are extracted from social media, or other sources converted in plain text for- 
mat and pre-processed. Pre-processing methods include standard NLP and 
text mining techniques such as stemming, tokenization, part of speech tagging, 
entity extraction and relation extraction. For an online text, specific data pre- 
processing methods include cleaning, like removing URLs, HTML tags, abbrevi- 
ation expansion, emoji, and repeated characters handling. The intermediate step 
of a Sentiment Analysis process is represented by the analysis module that can 
follow three types of approaches, that is, supervised, unsupervised and hybrid 
approach and it is based on three possible levels of analysis, i.e., document-based, 
sentence-based and aspect-based [24-26]. 

In lexicon based approaches, the starting point is a set of words in which 
for every term a given emotion or relative polarity is associated, with or with- 
out a score. This set of words can be manually expanded through the use of 
synonyms or antonyms following a dictionary-based approach. A major issue 
of lexicon based approaches is to not take into account the specific application 
domain, with a resulting low text-contextualization capability. Also statistical 
and semantic methods have been used to enrich the set of annotated words, fol- 
lowing a so called corpus-based approach. Despite having the advantage that the 
performance does not depend on the size of the dataset, as in machine learning 
approaches, a significant drawback in lexicon-based approach is that it is not 
suitable for the rapid change to which the language of the web is subject. To 
exploit the advantages of the two previous approaches, hybrid methodologies 
that combine machine learning techniques with lexicon-based approaches have 
been developed. 


3.1.1 Available Textual Datasets 

To identify the polarity/emotions of an input text, supervised methods need a 
set of annotated data on which the model is trained. One of the challenges in 
sentiment analysis is that polarity and emotions expressed in the text strongly 
depend on used language, context and domain. 

Most of the datasets used within the SA are manually annotated. One of the 
problems related to manual annotation is that the human evaluation of “senti- 
ment” is strongly conditioned by personal experiences, thoughts and beliefs. It 
is estimated that different people who read a text agree on the generic “senti- 
ment” contained in it only in the 60-65% of cases on average. It is therefore clear 
how difficult it might be to obtain high quality datasets, reaching high values of 
^nter-annotation agreement" levels, especially in the case of recognition of emo- 
tions. However, the classification of polarity is the textual categorization task 
that currently holds the largest number of well-noted datasets. For this reason, 
in this section most of the datasets reported are mainly annotated to perform 
polarity detection. Few are the datasets that can be used as a benchmark for 
the evaluation of the performance of the classification algorithms. 
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e IMDb is a dataset of movie reviews collected from "Internet Movie Database" 
(IMDb). There are several versions of the dataset that have been collected 
and annotated. The most used versions are those noted at the document level. 
In particular, in the version of Pang and Lee [27], known as *Movie Review 
Dataset", there are 2000 reviews, 1000 categorized as positive and 1000 as 
negative. A second dataset (called “IMDb dataset") was annotated by Maas 
et al. [28] and consists of 50,000 movie reviews. Positive polarity is associated 
with the text of the review if the movie has been evaluated with more than 
six stars and negative polarity otherwise. 

e Stanford Twitter Sentiment (STS), also known as Sentiment 140!, is an anno- 
tated dataset introduced by Go et al. [29]. The training set contains 1.6 million 
tweets containing emoticons. The annotation was performed automatically by 
assigning a positive label to the tweet containing positive emoticons :), :-),:),: 
D, or —) and negative if it contains negative emoticons :(, :-(, o: (. However, 
since the emoticons may not reflect the actual sentiment of the tweet, the 
dataset has been extensively used for subjectivity classification tasks as well 
as a dataset for sentiment analysis. 

e Sentiment Strength Twitter Dataset (SS-Tweet). Proposed by Thelwall et al. 
[30] for the evaluation of the SentiStrength tool , the dataset contains 4242 
manually recorded tweets. Unlike the datasets described so far, the annotation 
is ordinal, in a range —5 (extremely negative) to 5 (extremely positive). 

e SemEval Datasets: SemEval (Semantic Evaluation) is a series of computa- 
tional competitions of semantic analysis systems taking place annually. The 
sentiment analysis task was introduced for the first time in SemEval-2013. 
The dataset has been annotated with the use of Amazon Mechanical Turk?, 
for a total of 15196 tweets annotated for SA task at document and aspect 
level. The datasets from SemEval-2014 to SemEval-2016 are extensions of 
SemEval2013. In SemEval2016 the dataset has been extended to include other 
tasks, such as the quantification of tweets with the aim of estimating the dis- 
tribution of tweets between classes compared to individual tweets. Finally, 
the SemEval-2017 and SemEval-2018 competitions were more focused on the 
categorization of affect in the Tweets. They are emotional datasets, scoring 
for a single emotion, rating for a single emotion, classification among 9 emo- 
tions and in addition the neutral class, scoring and valence rating (agreement 
of terms of positive or negative sentiment) [31]. 


3.2 Affective Computing Methodologies 


A generic Affective Computing process starts with one or more biosignal acquisi- 
tion. Typically these include measures related to physical aspects and physiolog- 
ical signals. Only the first category, whose standard modalities are facial or body 
expressions (such as gestures and movements) and speech will be discussed. 
After this first step, a pre-processing of these signals is needed to remove or 
decrease the noise, that can be given, for example, by artifacts and, consequently, 


1 http: //help.sentiment140.com/for-students. 
? http://mturk.com. 
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increase the Ratio between Signal and Noise (SNR). Other pre-processing tasks 
are filtering and segmentation, concerning events or stimuli. A feature selection 
task can be applied to perform the analysis only on a reduced feature set. 

Depending on the type of analysis to be performed, different features can be 
considered, for example, time (e.g., statistical analysis), frequency (e.g., Fourier 
analysis), time-frequency (e.g., wavelets), or power domain (e.g., periodogram 
and autoregression). 


3.8 Emotion Recognition from Facial Expression 


The foundational study in [32] reported that the 55% of the communication is 
visual. Therefore, expressions and body gestures are considered the most obvious 
and significant channels to infer affect. Ekman's theory of basic emotion [11] is 
the dominant emotion theory to classify facial expressions. 

Each of the basic emotions is characterized by a series of muscular move- 
ments, formalized by what is called the Facial Action Coding System (FACS) 
and reported in Table2. The Facial Action Code System (FACS) was published 
by Paul Ekman and Wallace Friesen in 1978 and, subsequently, updated in 1992 
and again in 2002 [33]. The FACS was widely used for experiments on the recog- 
nition of emotions by the computer, from human facial expressions. This sys- 
tem objectively measures the frequency and intensity of facial expressions and 
deduces what is called an action unit (AU). 

In order to provide a specific index for each type of movement and expres- 
sion, the FACS takes into consideration 44 fundamental units named by Ekman 
and Friesen “Action Units AU" which can give rise to more than 7000 possible 
combinations. The total number of classified movements or characteristics is 58, 
some of which are typically associated with a specific emotion, while others are 
not associated with any other specific emotion. 


Table 2. Description of facial expressions in relation to the six Ekman's basic emotions 
theory. 


Emotion | Description of facial expression 


Disgust | Curled nose, raised cheeks, raised upper lip, upper eyelid and lowered 
eyebrows, raised lower eyelid 


Anger Lowered eyebrows tending to join in the center, tensions in the upper and 
lower eyelids, pressed lips 


Joy Wrinkles around the eyes, upper lower eyelids and raised cheeks, corners 
of the mouth stretched upwards 


Sadness | Inner angles of the raised eyebrows, inner angles of the upper eyelids 
raised, corners of the mouth downwards 


Surprise | Eyebrows raised and bent upwards, eyes wide open, mouth open 


Fear Raised eyebrows that tend to unite, upper eyelid raised, lower eyelid 
stretched, mouth open and lips stretched outward 
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Some of the most important techniques for facial expression recognition are 
briefly described below: 


e Active Appearance Models (AAM) [34]: they are well-known algorithms for 
modeling deformable objects. The models decouple the shape and texture of 
objects, using a gradient-based model adaptation approach. The most pop- 
ular applications of AAM include recognition, tracking, segmentation and 
synthesis. 

e Active Shape Models (ASM) [35]: these are statistical models that adapt to 
the data or object of an image in a manner consistent with the training data 
provided. These models are mainly used to improve the automatic analysis 
of images in noisy or messy environments. 

e Muscle-based models [36]: these are models that consist of characteristic facial 
points corresponding to the facial muscles, for detecting the movement of 
facial components, such as the eyebrows, the eyes and the mouth, thus rec- 
ognizing facial expressions. 

e Constrained local 3D model (CLM-Z) [37] is a non-rigid face tracking model 
used to trace facial features in various poses, including both depth and inten- 
sity information. Non-rigid face tracking refers to points of interest in an 
image, such as the tip of the nose, the corners of the eyes and lips. The CLM- 
Z model can be described by the parameters p = [s, R, q, t], where s is a scale 
factor, R is the rotation of the object, t represents the 2D translation and q 
is the vector that describes the non-rigid variation of the q. 

e GAVAM (Generalized Adaptive View-Based Appearance Model) [38] is a 
probabilistic structure that combines dynamic or movement-based approaches 
to track the position and orientation of the head through video sequences and 
employs user-independent static approaches to detect the head position from 
an image. GAVAM is considered a real-time high-precision, user-independent 
algorithm for tracking the position of the head in real time. 


3.4 Emotion Recognition from Speech 


The analysis of expressive language consists in examining the paralinguistic char- 
acteristics, that is, the aspects of verbal and non-verbal communication, such as 
the tone of the voice and its intensity. The analysis can be conducted from differ- 
ent points of view, including signal processing, linguistics, psychoacoustics and 
speech recognition. 

A first type of analysis is based on the construction of voice production 
models that try to model speech (speech) considering the breathing mechanisms 
and the structure of the phonatory apparatus (primarily vocal cords, mouth and 
nose). A second approach involves the study of speech from the point of view 
of perception that analyzes how speech is perceived and processed by the ear 
and the brain. The first approach, specifically, seeks to model the production of 
the item using mathematical models of the vocal tract. For the formulation of 
mathematical models, the vocal tract is studied by analyzing images of the vocal 
part obtained by ultrasonography, digital radiography and magnetic resonance. 
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Variations in the breathing pattern, specific vocal cord shape factors can 
determine variations in prosodic parameters, such as duration, intensity, funda- 
mental frequency and spectral content of speech. Specifically, the fundamental 
frequency is the vibration speed of the vocal cords and depends on the size and 
tension of the vocal cords at a given instant of time. It can change in relation to 
stress, emotion and level of intonation. 

In [39] and [40] the most relevant features for the recognition of emotions like 
the pitch contour, the energy of speech signals, and features related to spectral 
content are described. At the linguistic level, the analysis for the recognition of 
emotions involves the identification of the intonation of sentences, analysis of 
effort and accent in the pronunciation of words and sentences. 


3.4.1 Existing Databases of Emotional Speech 


e EMODB*: The Berlin Database of Emotional Speech (EMODB) is a public 
German speech database that incorporates audio files with seven emotions: 
happiness, sadness, anger, fear, disgust, boredom, and neutral [41]. 

e SAVEE?: The Surrey Audio-Visual Expressed Emotion (SAVEE) is a public 
British English speech database that has audio files with seven emotion labels: 
happiness, sadness, anger, fear, disgust, surprise, and neutral [42]. 

e EMOVO?: is a public Italian speech database that includes audio files with 
seven emotion labels: happiness, sadness, anger, fear, disgust, surprise, and 
neutral [43]. 


4 Deep Learning Algorithms for Emotion Detection 


The performance of the emotion extraction approaches presented so far is 
strongly related to how data are represented. In this sense, an essential step 
is feature engineering, i.e., the process that uses domain knowledge to design a 
good representation of data in terms of suitable features. By taking advantage of 
the large training dataset, Deep Learning Algorithm seeks to learn data repre- 
sentation along with the mapping that associates each input representation to its 
output. Moreover, Deep Neural Networks (DNNs) are Artificial Neural Networks 
designed to have different levels of non-linear and nested operations, that have 
shown to improve non-linear model tasks. The previous statements are some 
of the key points explaining the popularity of Deep Neural Networks (DNNs) 
and why they have increasingly been implemented also to face the problem 
of emotion recognition, by achieving state-of-the-art results for a wide range of 
tasks [44,45]. Several DNNs architectures have been proposed both in Sentiment 
Analysis and Affective Computing, for example, Convolutional Neural Networks 
(CNN) [46,47], Recurrent Neural Networks (RNNs) with or without attention 
mechanism, Autoencoders [48] and also Deep Belief Networks (DBN) [48-50]. 


3 http:/ /emodb.bilderbar.info/start.html. 
* http:/ /personal.ee.surrey.ac.uk/Personal/P.Jackson/S AVEE/. 
5 http://voice.fub.it/activities/corpora/emovo/index.html. 
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Both in Sentiment Analysis and Affective Computing, a key role is played in 
capturing long-term dependencies intended for example as extracting relations 
among distant words in a sentence but also as capturing temporal variations in 
facial or vocal expressions. To address this problem, a particular set of RNNs 
model is typically used in a practical application and are called gated recurrent 
RNNs, in particular, Long Short Term Memory (LSTM) network [51]. For what 
concerns textual emotion recognition, LSTMs models are the state-of-the-art 
algorithm in time series predictions features, for example monitoring systems 
[52]. In Face emotion recognition, Gated RNNs have been combined with CNN 
to improve sequential images modelling [53]. 


5 Challenges and Tools for Multimodal Emotion 
Recognition 


Recently, the scientific community has been increasing efforts for the joint appli- 
cation of Sentiment Analysis and Affective Computing techniques to create 
multi-modal systems, especially for monitoring and preventing mental health. 

For example, in [54], a system that combines Sentiment Analysis and Affec- 
tive Computing techniques to assess a subject’s mental health is presented. In 
particular, the authors proposed the use of embedded sensors in mobile devices 
(such as laptops and smartphones) to trace head and eye movements, facial 
expressions as well as heartbeat. Among the features useful to verify interactions 
among users, the speed of typing, the number of clicks and mouse movements, 
etc. were considered, starting from the assumption that a positive or negative 
mood has effects on the different degrees of activity of the user. Lastly, the mon- 
itoring of sentiment associated with the text, related to user posts on Twitter, 
was performed using a free tool for sentiment analysis, i.e., Sentiment 1409 and 
also a prediction algorithm for images posted along with tweets was used. 

'The major challenges for developing an integrated system, especially in com- 
bining data from integrated daily devices, can be identified in the following: 


1. input data should be appropriate to the type of analysis to be made. For exam- 
ple, even if smartphones allow videos with good quality, the facial expression 
recognition process requires high-resolution images; 

2. the choice of appropriate pre-processing, feature extraction and analysis tech- 
niques to achieve good performance is mandatory; 

3. the selection of the most suitable approach for integrating information 
extracted from multiple sources in the system is crucial. 


Concerning the last point, commonly used approaches are the following: 


e Fusion at feature-level: after a first phase of pre-processing of data extracted 
from different sources, all features are considered as different components of 
a joined feature vector, and then classification is performed accordingly. 


8 http:/ /www.sentiment140.com/. 
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e Fusion at decision level: instead of combining features in a single vector fea- 
tures as in feature-level fusion, a separate classifier for each modality is used. 
The output of each classifier was treated as a classification score. 


In [55], both the previous approaches were tested for integrating facial expres- 
sion, speech, and textual data to build a multi-modal sentiment analysis frame- 
work. The experimental results show that the accuracy of fusion at the fea- 
ture level is higher than the accuracy of fusion at decision-level. Accurately, the 
authors reported a precision value of 78.2% and a recall value of 77.1% for tests 
relative to feature-level fusion, whereas they referred a precision value of 75.2% 
and a recall value of 73.496 for decision level fusion. Another remarkable point 
is that, regardless of the fusion techniques, the results show how the simulta- 
neous use of video, text and audio modalities allows achieving better accuracy 
than when only pairs of the three patterns are considered. Considering the app- 
roach based on fusion at features level, the precision values of experiments are: 
(i) 72.4596 by using only visual and text-based features, (ii) 73.21% by using 
visual and audio-based features and (iii) 71.1596 by using audio and text-based 
features. 


5.1 Existing Multimodal Dataset for Emotion Recognition 


e SEMAINE Database. This dataset was developed in 2007 by McKeown et 
al. [56]. It is a large audiovisual database created for building agents capa- 
ble of involving a person in a prolonged and emotional conversation using a 
Sensitive Artificial Listener (SAL) [57] paradigm. SAL is an interaction that 
involves two parts: a ‘man’ and an ‘operator’ (a machine or a person who 
simulates a machine). There were 150 participants, 959 conversations, each 
lasting 5 min. For the recordings, participants were asked to speak in turn to 
four emotionally stereotyped characters. The characters are Prudence, which 
is balanced and sensitive; Poppy, who is happy and outgoing; Spike, who is 
angry and in conflict; and Obadiah, who is sad and depressive. 

e Interactive emotional dyadic acquisition database (IEMOCAP). The IEMO- 
CAP dataset was developed in 2008 by Busso et al. [58]. 10 actors were asked 
to record their facial expressions in front of the cameras. In particular, the 
dataset contains a total of 10h of recording, each of which expresses one of 
the following emotions: happiness, anger, sadness, frustration and a neutral 
state. 

e eNTERFACE. This dataset was developed in 2006 by Martin et al. [59] and 
contains audio and video for the evaluation of algorithms for the recognition of 
emotions from audio and video. The emotions labeled are: happiness, sadness, 
surprise, anger, disgust and fear. 

e CK++ dataset: the Cohn Kanade dataset contains facial images of 210 adults. 
The participants are 18-50 years old, 81% Americans, 13% Afro Americans 
and 696 of other ethnic groups; 6996 females. Participants are asked to perform 
23 facial expressions. 
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e Belfast Database. This data set was developed in 2000 by Douglas-Cowie et al. 
[57]. The database consists of audiovisual data of people discussing emotional 
issues and are taken from television chat programs and religious programs. 
Includes 100 speakers and 239 clips, with 1 neutral clip and 1 emotional 
clip for each speaker. Two types of descriptors were provided for each clip: 
dimensional and categorical, according to the different emotion approaches. 


6 Conclusions 


This article presented an overview of the existing approaches for extracting senti- 
ment and emotion from different input modalities through the use of Sentiment 
Analysis and Affective Computing techniques. In particular, audio, video and 
textual data were considered and, for each input modality a pipeline of analysis, 
existing datasets and tools were presented. Deep Learning approaches were also 
considered and discussed. Subsequently, recent efforts and challenges to combine 
these different unimodal approaches toward multimodal systems were reported. 
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